CS302 -- Lab 6
Lab Objective
This lab is designed to give you experience with implementing
and using various sorting methods. Specifically you will
implement two variations of the sort-merge
algorithm for external sorting--one that uses quicksort
for internal sorting, and one that uses
a priority queue for replacement sorting.
Setting Up
You should copy the following files from the
/home/bvz/cs302/lab6 directory to your directory:
You can use the data files to test your programs.
A working version of sortmerge is also in this directory
If you have a question
about what your program should do, first see what my program
does. My version of sortmerge has an additional -d flag that
you can use if you want the program to pause after each merge phase so
that you can examine my scratch files and see what they look like.
Problem Statement
In this lab you will implement the balanced, multi-way external sorting
algorithm presented in class. You will also experiment with varying
the size of the initial runs and the size of P to get an idea of how
these parameters affect the performance of the algorithm. Here is a
description of the program you are to write:
Name
sortmerge - sort and collate lines of a file.
Synopsis
sortmerge [ -p num_ways ] [ -r run_size] [ -q ] input_file output_file
Description
sortmerge sorts the lines in input_file from smallest to
largest and stores the sorted lines in output_file.
You may assume that each line in input_file has three integer
fields. The first field is the sort key. An example data file may be
found in data. If there is a tie between the sort keys it
does not matter in which order the keys are placed.
Options
- -p num_ways: The value of P for the P-way sort. If no
value is provided, the value defaults to 3.
- -r run_size: The size of the initial runs. If no
value is provided, the value defaults to 5.
- -q: If the -q option is specified, then sortmerge should
perform replacement selection using a priority queue. Otherwise
sortmerge should use quicksort to sort the runs.
You should also allow the TAs to pause your program after each of your
merge phases so that they can examine your scratch files. You should
pause your program only after you have closed the scratch files from
the previous merge phase because they the new output scratch files will
appear empty until you have done so. The following procedure provides
an easy way to put this pause into your program and you should feel free
to copy it into your program:
void waitForReturnKey() {
char okToProceedChar; // place to store the char the user enters to continue
printf("hit 'return' to continue.\n");
while(true) {
okToProceedChar = getchar();
if (okToProceedChar == '\n') break;
}
}
Examples
- Sort the contents of data and store the sorted file in
output-data:
sortmerge data output-data
-
Sort the contents of data using a 10-way sort and store the
sorted file in output-data:
sortmerge -p 10 data output-data
-
Sort the contents of data using a 4-way sort and an initial
run size of 6. Store the
sorted file in output-data:
sortmerge -r 6 -p 4 data output-data
- Sort the contents of data using a 4-way sort and an initial
run size of 6. Use replacement selection and store the sorted file in
output-data:
sortmerge -r 6 -p 4 -q data output-data
Output
The output of sortmerge is the sorted output file and a printed
list of three statistics:
- Number of Reads: The total number of lines that you read during the
sort and merge phases. This number should include the number of lines
that you read from the input file plus the number of lines that you read
from any of the scratch files.
- Number of Writes: The total number of lines that you wrote during the
sort and merge phases. This number should include the number of lines
that you wrote to the output file plus the number of lines that you wrote
to any of the scratch files. This number
should be the same as the number of reads.
- Number of Passes: The number of passes your program requires
to sort the input. The number of passes should be equal to the number
of merge phases you perform plus one for the pass required to sort
the initial set of runs.
Error Checking
The only error checking that you need to do for this lab is to check
the command line arguments and ensure that: 1) any flags are valid
flags, 2) any numeric parameters are non-negative, and 3) both an
input and output file are specified.
Strategy
Write this program using the pseudo-code provided in class and the
various classes that we have provided for you.
I created the sortmerge program incrementally using the following
steps:
- Write and test an input routine for reading data from the input
file and distributing it onto P scratch output files using
non-replacement sorting. This
input routine repetitively reads data into an array equal to the run size,
sorts the data using the C++
sort function (which implements the quicksort algorithm), and then
outputs the run to the appropriate output file. You should use the
Fields package to read in a file and fopen to write out a
file.
- Write and test a routine for merging one set of runs from the scratch
input files into a new run and writing the new run to a scratch
output file.
- Write and test the sort merge procedure. It will have to call the
routine written in the previous step to create runs and it will
have to manage the scratch files so that on each iteration, the
previous input files become output files and the previous output
files become input files.
- Write and test an input routine for reading data from the input
file and distributing it onto P scratch output files using
replacement sorting. This
input routine repetitively reads data into a priority queue whose
size equals the run size and then creates a run by removing records
from the priority queue and outputting them to the appropriate output
scratch file. Since a record's position in the priority queue
depends on both the value of
its key and whether it belongs in the current or the next run, the
"key" for the priority queue
is actually a combination of two elements. Therefore
you will need to write your own comparison function that compares two
records. The records that you store into the priority queue
will need to have
an additional field that indicates whether or not the record belongs
in the current run or the next run.
- Write and test the main procedure so that it handles the -r,
-p, and -q switches.
In the initial phases of writing the
program, I used the default values for the initial run sizes and
the value of P. I alternately tested the program using
quicksort for non-replacement sorting and the priority queue for
replacement sorting.
In addition to writing the program, you will need to make some decisions
about the use of data structures. There are three principal data structures
you will need:
- A data structure for creating the initial runs. This data structure
should either be an array if you use internal sorting or
the C++ STL priority queue. Do not use a C++ vector instead of an
array. Because you know the exact size of the array it is better to
use a fixed size array rather than a vector, which is meant to
handle a dynamically expanding array.
- A data structure for managing the two scratch file banks.
Since the number of files is variable and depends on P, you
will need to dynamically allocate an array to
hold each of your file banks. One of the arrays will be a set
of file pointers to the output scratch files and one of the
arrays will be a set of Fields pointers to the input scratch files.
Since you don't know in advance the size of p you will need
to dynamically name your scratch files. I suggest choosing a base
string, like "scratch", and then appending a number to it to generate
your names (e.g., "scratch1", "scratch2", etc).
- A data structure for managing a line you read in. A class that
contains an integer field for the key, a string field for
a line should suffice, and a boolean flag for replacement selection
that indicates whether the line belongs in the current or next run
should suffice.
Experiments
Once you have completed sortmerge, you will perform a series of
experiments
that vary the initial run size and P parameters. You should use
initial run sizes of 10, 100, and 1000 and values of P equal to
2, 4, 8, and 16. For each experiment you should sort the file data
found in lab6/data. You should
record 1) the aggregate number of file reads and writes (i.e., the
sum of the number of calls you make to read a line from a file and the number
of calls you make to write a line to a file)
and 2) the
number of passes your program requires
to sort the input. The number of passes should be equal to the number
of merge phases you perform plus one for the pass required to sort
the initial set of runs.
The results should be recorded in a table that
looks as follows:
\ P 2 4 8 16
Run Size \
10 i/o, pp i/o, pp i/o, pp i/o, pp
100 i/o, pp i/o, pp i/o, pp i/o, pp
1000 i/o, pp i/o, pp i/o, pp i/o, pp
You will need one table for the experiments you run with quicksort and
a second table for the experiments you run with replacement selection.
You should also produce three graphs of the results, one for each of the
three runsizes. The x-axis should be
P and the y-axis should be the number of reads/writes.
Each graph should have two curves, one for
sort-merge using quicksort and one for sort-merge using replacement selection.
On the x-axis, you should evenly space 2, 4, 8, and 16. You should notice
that replacement selection consistently requires one less pass than
quicksort and saves an aggregate number of reads and writes that is roughly
equal to two times the number of lines in the file.
i/o stands for the aggregate number of reads and writes
and pp stands for the number of
passes.
Even if you cannot get sortmerge to work, you can use the sortmerge in the
lab6 directory to conduct these experiments.
What to Submit
You should submit the following items:
- The tables and graphs described in the previous section. The tables
should be placed in a file called answers and submitted with
your program files (see item 2 below). The graphs should be handed
in to the TAs.
- The two data files from the cs302/lab6 directory, plus your make,
.h, and .cpp files.