CS302 -- Lab 6


Lab Objective

This lab is designed to give you experience with implementing and using various sorting methods. Specifically you will implement two variations of the sort-merge algorithm for external sorting--one that uses quicksort for internal sorting, and one that uses a priority queue for replacement sorting.


Setting Up

You should copy the following files from the /home/bvz/cs302/lab6 directory to your directory:

You can use the data files to test your programs.

A working version of sortmerge is also in this directory If you have a question about what your program should do, first see what my program does. My version of sortmerge has an additional -d flag that you can use if you want the program to pause after each merge phase so that you can examine my scratch files and see what they look like.


Problem Statement

In this lab you will implement the balanced, multi-way external sorting algorithm presented in class. You will also experiment with varying the size of the initial runs and the size of P to get an idea of how these parameters affect the performance of the algorithm. Here is a description of the program you are to write:

Name

sortmerge - sort and collate lines of a file.

Synopsis

sortmerge [ -p num_ways ] [ -r run_size] [ -q ] input_file output_file

Description

sortmerge sorts the lines in input_file from smallest to largest and stores the sorted lines in output_file.

You may assume that each line in input_file has three integer fields. The first field is the sort key. An example data file may be found in data. If there is a tie between the sort keys it does not matter in which order the keys are placed.

Options

You should also allow the TAs to pause your program after each of your merge phases so that they can examine your scratch files. You should pause your program only after you have closed the scratch files from the previous merge phase because they the new output scratch files will appear empty until you have done so. The following procedure provides an easy way to put this pause into your program and you should feel free to copy it into your program:
void waitForReturnKey() {
  char okToProceedChar; // place to store the char the user enters to continue
    printf("hit 'return' to continue.\n");
    while(true) {
      okToProceedChar = getchar();
      if (okToProceedChar == '\n') break;
    }
}

Examples

  1. Sort the contents of data and store the sorted file in output-data:

        sortmerge data output-data
    

  2. Sort the contents of data using a 10-way sort and store the sorted file in output-data:

        sortmerge -p 10 data output-data
    

  3. Sort the contents of data using a 4-way sort and an initial run size of 6. Store the sorted file in output-data:

        sortmerge -r 6 -p 4 data output-data
    

  4. Sort the contents of data using a 4-way sort and an initial run size of 6. Use replacement selection and store the sorted file in output-data:
       sortmerge -r 6 -p 4 -q data output-data
    


Output

The output of sortmerge is the sorted output file and a printed list of three statistics:

  1. Number of Reads: The total number of lines that you read during the sort and merge phases. This number should include the number of lines that you read from the input file plus the number of lines that you read from any of the scratch files.
  2. Number of Writes: The total number of lines that you wrote during the sort and merge phases. This number should include the number of lines that you wrote to the output file plus the number of lines that you wrote to any of the scratch files. This number should be the same as the number of reads.
  3. Number of Passes: The number of passes your program requires to sort the input. The number of passes should be equal to the number of merge phases you perform plus one for the pass required to sort the initial set of runs.


Error Checking

The only error checking that you need to do for this lab is to check the command line arguments and ensure that: 1) any flags are valid flags, 2) any numeric parameters are non-negative, and 3) both an input and output file are specified.


Strategy

Write this program using the pseudo-code provided in class and the various classes that we have provided for you.

I created the sortmerge program incrementally using the following steps:

  1. Write and test an input routine for reading data from the input file and distributing it onto P scratch output files using non-replacement sorting. This input routine repetitively reads data into an array equal to the run size, sorts the data using the C++ sort function (which implements the quicksort algorithm), and then outputs the run to the appropriate output file. You should use the Fields package to read in a file and fopen to write out a file.

  2. Write and test a routine for merging one set of runs from the scratch input files into a new run and writing the new run to a scratch output file.

  3. Write and test the sort merge procedure. It will have to call the routine written in the previous step to create runs and it will have to manage the scratch files so that on each iteration, the previous input files become output files and the previous output files become input files.

  4. Write and test an input routine for reading data from the input file and distributing it onto P scratch output files using replacement sorting. This input routine repetitively reads data into a priority queue whose size equals the run size and then creates a run by removing records from the priority queue and outputting them to the appropriate output scratch file. Since a record's position in the priority queue depends on both the value of its key and whether it belongs in the current or the next run, the "key" for the priority queue is actually a combination of two elements. Therefore you will need to write your own comparison function that compares two records. The records that you store into the priority queue will need to have an additional field that indicates whether or not the record belongs in the current run or the next run.

  5. Write and test the main procedure so that it handles the -r, -p, and -q switches. In the initial phases of writing the program, I used the default values for the initial run sizes and the value of P. I alternately tested the program using quicksort for non-replacement sorting and the priority queue for replacement sorting.

In addition to writing the program, you will need to make some decisions about the use of data structures. There are three principal data structures you will need:

  1. A data structure for creating the initial runs. This data structure should either be an array if you use internal sorting or the C++ STL priority queue. Do not use a C++ vector instead of an array. Because you know the exact size of the array it is better to use a fixed size array rather than a vector, which is meant to handle a dynamically expanding array.

  2. A data structure for managing the two scratch file banks. Since the number of files is variable and depends on P, you will need to dynamically allocate an array to hold each of your file banks. One of the arrays will be a set of file pointers to the output scratch files and one of the arrays will be a set of Fields pointers to the input scratch files. Since you don't know in advance the size of p you will need to dynamically name your scratch files. I suggest choosing a base string, like "scratch", and then appending a number to it to generate your names (e.g., "scratch1", "scratch2", etc).

  3. A data structure for managing a line you read in. A class that contains an integer field for the key, a string field for a line should suffice, and a boolean flag for replacement selection that indicates whether the line belongs in the current or next run should suffice.


Experiments

Once you have completed sortmerge, you will perform a series of experiments that vary the initial run size and P parameters. You should use initial run sizes of 10, 100, and 1000 and values of P equal to 2, 4, 8, and 16. For each experiment you should sort the file data found in lab6/data. You should record 1) the aggregate number of file reads and writes (i.e., the sum of the number of calls you make to read a line from a file and the number of calls you make to write a line to a file) and 2) the number of passes your program requires to sort the input. The number of passes should be equal to the number of merge phases you perform plus one for the pass required to sort the initial set of runs. The results should be recorded in a table that looks as follows:

       
        \  P       2          4            8         16
Run Size \
       
10             i/o, pp    i/o, pp      i/o, pp    i/o, pp

100            i/o, pp    i/o, pp      i/o, pp    i/o, pp

1000           i/o, pp    i/o, pp      i/o, pp    i/o, pp

You will need one table for the experiments you run with quicksort and a second table for the experiments you run with replacement selection. You should also produce three graphs of the results, one for each of the three runsizes. The x-axis should be P and the y-axis should be the number of reads/writes. Each graph should have two curves, one for sort-merge using quicksort and one for sort-merge using replacement selection. On the x-axis, you should evenly space 2, 4, 8, and 16. You should notice that replacement selection consistently requires one less pass than quicksort and saves an aggregate number of reads and writes that is roughly equal to two times the number of lines in the file.

i/o stands for the aggregate number of reads and writes and pp stands for the number of passes. Even if you cannot get sortmerge to work, you can use the sortmerge in the lab6 directory to conduct these experiments.

What to Submit

You should submit the following items:

  1. The tables and graphs described in the previous section. The tables should be placed in a file called answers and submitted with your program files (see item 2 below). The graphs should be handed in to the TAs.

  2. The two data files from the cs302/lab6 directory, plus your make, .h, and .cpp files.