CS302 --- External Sorting
Overview
External Sorting--This term is used to refer to sorting methods
that are employed when the data to be sorted is too large
to fit in primary memory.
Characteristics of External Sorting
- During the sort, some of the data must be stored externally.
Typically the data will be stored on tape or disk.
- The cost of accessing data is significantly greater than either
bookkeeping or comparison costs.
- There may be severe restrictions on access. For example, if
tape is used, items must be accessed sequentially.
Criteria for Developing an External Sorting Algorithm
- Minimize number of times an item is accessed.
- Access items in sequential order
Important Uses of External Sorting
- Business applications where a "transaction" file updates
a master file.
Example: Updating an inventory database based on sales
Updating a personnel database based on new hires,
promotions, dismissals, etc.
- Database applications
- Projection: The user requests a subset of the fields in
a file. When a subset of the fields is taken, there might
be duplicate records, so an external sort is used to
remove duplicates.
- Join: Two files are "joined" on a common field(s) to create
a new file whose fields are the union of the fields of the
two files. The two files must be sorted so that the "join"
fields can be matched.
Example: Suppose one database contains information about
courses and rooms and another database contains
information about students and courses. To find out
which classrooms are being used by CS students, one
could write the query:
Select from student database records with
student.major = CS
Join the records selected in the previous query
with the courses database using the common
courses field.
Project the result on rooms to produce a listing
of the rooms being used by CS students
Merge Sort--A Digression
Merge sort is an ideal candidate for external sorting because it satisfies
the two criteria for developing an external sorting algorithm. Merge sort
can be implemented either top-down or bottom-up. The top-down strategy is
typically used for internal sorting, whereas the bottom-up strategy is
typically used for external sorting.
Top-Down Strategy
The top-down strategy works by:
- Dividing the data in half
- Sorting each half
- Merging the two halves
Example
Try using this strategy on the following set of characters
(ignore the blanks):
a sorting example
Code
This section presents code for implementing top-down
mergesort using arrays.
Merge code for arrays
i = 1;
j = 1;
a[M+1] = INT_MAX;
b[N+1] = INT_MAX;
for (k = 1; k <= M+N; k++)
c[k] = (a[i] < b[j]) ? a[i++] : b[j++];
Array Mergesort
mergesort(int a[], int left, int right)
{
int i, j, k, mid;
if (right > left) {
mid = (right + left) / 2;
mergesort(a, left, mid);
mergesort(a, mid+1, right);
/* copy the first run into array b */
for (i = left, j = 0; i <= mid; i++, j++)
b[j] = a[i];
b[j] = MAX_INT;
/* copy the second run into array c */
for (i = mid+1, j = 0; i <=right; i++, j++)
c[j] = a[i];
c[j] = MAX_INT;
/* merge the two runs */
i = 0;
j = 0;
for (k = left; k <= right; k++)
a[k] = (b[i] < c[j]) ? b[i++] : c[j++];
}
}
- This code is a more straightforward but less elegant version
of mergesort than the mergesort routine presented on page 166
of Sedgewick.
- My code incorporates the merging code shown in the first
bullet.
- Sedgewick's code uses a characteristic of the
two halves of the data to cleverly avoid using sentinels.
Right before the merging code begins, he also manages to avoid
the assignment i = left
by loading the first half of the
array from back to front
(i.e., by decrementing i rather
than incrementing i).
- Which is faster? I coded up both algorithms and ran them
on 100,000 elements. The result was a virtual deadheat
(mine ran in 2.93 seconds versus 2.95 seconds for Sedgewick).
- What is the moral? Sedgewick's code is undeniably more
elegant than mine. However, my code is more straightforward
and could be more easily understood by a programmer trying
to maintain the code. Since the two programs run almost
identically fast, I would prefer my code since it's easier
to understand. The moral here is that unless you can get
a significant performance improvement from clever but
subtle algorithmic tricks, you should use the more
straightforward approach. If you can get significant
performance improvements, than you should clearly document
the tricks you used so that a programmer unfamiliar with
your code can figure out what you were doing.
Bottom-Up Strategy
The bottom-up strategy is the strategy that we will use for external
sorting.
The bottom-up strategy for mergesort works by:
- Scanning through data performing 1-by-1 merges to get sorted
lists of size 2.
- Scaning through the size 2 sublists and perform 2-by-2 merges
to get sorted lists of size 4.
- Continuing the process of scanning through size n sublists
and performing n-by-n merges to get sorted lists of size 2n
until the file is sorted (i.e., 2n >= N, where N is the
size of the file).
Example
Try implementing this strategy on the following set of characters
(ignore the blanks):
a sorting example
You get the following runs:
as, or, it, gn, ex, am, lp, e
aors, gint, aemx, elp
aginorst, aeelmpx
aaeegilmnoprstx
Code for Bottom-Up Strategy
The code for doing an in-memory sort using the bottom-up strategy is
much more complicated than for the top-down strategy. In general, one
is better off using the top-down strategy if one wants to use mergesort
to perform an internal memory sort.
Mergesort Performance
- Mergesort requires about N lg N comparisons to sort any file of
N elements.
- Mergesort requires extra space proportional to N.
- Mergesort is stable.
- Mergesort is insensitive to the initial order of its input.
In other words, it will require roughly N lg N comparisons
on any input.
Programs
The source code for the array, list, and bottom-up mergesorts
are contained in the directory /sunshine/bvz/courses/302/src/ under
the names array_mergesort.c, list_mergesort.c, and
bottom_up_mergesort.c respectively. Binary files with the same
names (with the .c omitted of course) can be found in
/sunshine/bvz/courses/302/bin/. The programs take one argument, the
name of a file to sort, and output the sorted file (they do not
write to the file so the file remains unsorted). For example, one
could type:
array_mergesort my_file
Test data can be generated using the program
generate_mergesort_data,
which is found in the above mentioned bin directory.
generate_mergesort_data takes the
number of integers to be generated as an argument and outputs the
desired number of integers. The integers are generated randomly.
For example, one could type:
generate_mergesort_data 10 > my_file
array_mergesort my_file