CS302 --- External Sorting

Brad Vander Zanden


Overview

External Sorting--This term is used to refer to sorting methods that are employed when the data to be sorted is too large to fit in primary memory.

Characteristics of External Sorting

  1. During the sort, some of the data must be stored externally. Typically the data will be stored on tape or disk.
  2. The cost of accessing data is significantly greater than either bookkeeping or comparison costs.
  3. There may be severe restrictions on access. For example, if tape is used, items must be accessed sequentially.

Criteria for Developing an External Sorting Algorithm

  1. Minimize number of times an item is accessed.
  2. Access items in sequential order

Important Uses of External Sorting

  1. Business applications where a "transaction" file updates a master file.
    	    Example: Updating an inventory database based on sales
    		     Updating a personnel database based on new hires,
    			promotions, dismissals, etc.
    
  2. Database applications
    		Example: Suppose one database contains information about
    		    courses and rooms and another database contains 
    		    information about students and courses. To find out
    		    which classrooms are being used by CS students, one
    		    could write the query:
    
    			Select from student database records with
    			    student.major = CS
    			Join the records selected in the previous query
    			    with the courses database using the common
    			    courses field.
    			Project the result on rooms to produce a listing
    			    of the rooms being used by CS students
    

Merge Sort--A Digression

Merge sort is an ideal candidate for external sorting because it satisfies the two criteria for developing an external sorting algorithm. Merge sort can be implemented either top-down or bottom-up. The top-down strategy is typically used for internal sorting, whereas the bottom-up strategy is typically used for external sorting.

Top-Down Strategy

The top-down strategy works by:
  1. Dividing the data in half
  2. Sorting each half
  3. Merging the two halves

Example

Try using this strategy on the following set of characters (ignore the blanks):
		a sorting example 

Code

This section presents code for implementing top-down mergesort using arrays.

Merge code for arrays

		i = 1;
		j = 1;
		a[M+1] = INT_MAX;
		b[N+1] = INT_MAX;
		for (k = 1; k <= M+N; k++)
		    c[k] = (a[i] < b[j]) ? a[i++] : b[j++];

Array Mergesort

	  	mergesort(int a[], int left, int right) 
		{
		    int i, j, k, mid;

		    if (right > left) {
			mid = (right + left) / 2;
			mergesort(a, left, mid);
			mergesort(a, mid+1, right);
			/* copy the first run into array b */
			for (i = left, j = 0; i <= mid; i++, j++) 
			    b[j] = a[i];
			b[j] = MAX_INT;
			/* copy the second run into array c */
			for (i = mid+1, j = 0; i <=right; i++, j++)
			    c[j] = a[i];
			c[j] = MAX_INT;
			/* merge the two runs */
			i = 0;
			j = 0;
			for (k = left; k <= right; k++)
			    a[k] = (b[i] < c[j]) ? b[i++] : c[j++];
		    }
		}

Bottom-Up Strategy

The bottom-up strategy is the strategy that we will use for external sorting. The bottom-up strategy for mergesort works by:
  1. Scanning through data performing 1-by-1 merges to get sorted lists of size 2.
  2. Scaning through the size 2 sublists and perform 2-by-2 merges to get sorted lists of size 4.
  3. Continuing the process of scanning through size n sublists and performing n-by-n merges to get sorted lists of size 2n until the file is sorted (i.e., 2n >= N, where N is the size of the file).

Example

Try implementing this strategy on the following set of characters (ignore the blanks):
		a sorting example 

You get the following runs:

		as, or, it, gn, ex, am, lp, e
		aors, gint, aemx, elp
		aginorst, aeelmpx
		aaeegilmnoprstx

Code for Bottom-Up Strategy

The code for doing an in-memory sort using the bottom-up strategy is much more complicated than for the top-down strategy. In general, one is better off using the top-down strategy if one wants to use mergesort to perform an internal memory sort.

Mergesort Performance

  1. Mergesort requires about N lg N comparisons to sort any file of N elements.
  2. Mergesort requires extra space proportional to N.
  3. Mergesort is stable.
  4. Mergesort is insensitive to the initial order of its input. In other words, it will require roughly N lg N comparisons on any input.

Programs

The source code for the array, list, and bottom-up mergesorts are contained in the directory /sunshine/bvz/courses/302/src/ under the names array_mergesort.c, list_mergesort.c, and bottom_up_mergesort.c respectively. Binary files with the same names (with the .c omitted of course) can be found in /sunshine/bvz/courses/302/bin/. The programs take one argument, the name of a file to sort, and output the sorted file (they do not write to the file so the file remains unsorted). For example, one could type:
		array_mergesort my_file
Test data can be generated using the program generate_mergesort_data, which is found in the above mentioned bin directory. generate_mergesort_data takes the number of integers to be generated as an argument and outputs the desired number of integers. The integers are generated randomly. For example, one could type:
		generate_mergesort_data 10 > my_file
		array_mergesort my_file