B-Tree Code

The complete code for inserting into a B-tree and searching a B-tree can be found in /azure/homes/bvz/courses/302/src/btree. In this lecture we will only discuss insertion. However, the search code is very similar to the code we use for finding the record node into which a record should be inserted.

Insertion

Overview

The code that will be presented assumes that we are inserting records with integer keys into a B-tree. The first part of the insertion routine locates the desired record page and the second part inserts the record into this record page. The first part may split index nodes as it proceeds down the tree.

Initialization

Before starting our search of the index, we need to allocate buffers for the parent and node pointers. We need to keep a trailing parent pointer in case we split a node and need to promote a key to the parent node.

insert(Btree tree, int key, char *record) {
  int level = 0;
  PAGE parent = NULL;
  PAGE node;
  int i, parent_link;
  int disk_num, page_num;
  int num_index_levels;

  /* get the root node */
  parent = make_page_buffer(M_I, INDEX_ENTRY_SIZE, '\0');
  node = make_page_buffer(M_I, INDEX_ENTRY_SIZE, '\0');

After allocating the buffers for the parent and node, we read the root page into memory. For simplicity, we will assume that the root page is on disk rather than in main memory. The root page is accessed via the root_disk and root_page_num fields of the B-tree data structure. Both of these fields are integers:

  read_disk(tree->root_disk, node, tree->root_page_num);

Handy Inspection and Debugging Tools

It can be awfully tricking ensuring that everything gets written correctly to a disk or to a page buffer, or read correctly from a disk to a page buffer. Hence, the disk package has two handy commands for inspecting the contents of page on a disk and inspecting the contents of a page buffer.

inspect_disk_page(int disk, int page). Let's say you wanted to see the contents of the root page on the disk. You can either put the inspect_disk_page command in your program or, when in a debugger such as gdb, simply type at the gdb prompt:

(gdb) p inspect_disk_page(tree->root_disk, tree->root_page_num)

You will get a list of the page's contents. For example:

rec      record contents
-------------------------------------------------------
0       0 0 5
1       0 3 8
2       1 1 12
3       1 2 18
4       0 1 19
5       1 0 -1

inspect_page(PAGE page_buffer). Once the page has been read from the disk into a page buffer, you can inspect the contents of the page buffer at any time using the inspect_page command. It prints the contents of a page buffer using the same format shown above.

Searching the Index

Now that the root page is in memory we can proceed to our search of the index. Since we do not keep an explicit type indicator in our page nodes, we cannot easily determine whether a page we have read into memory is an index node or a record node. Hence, we keep track of how many levels are in the index and how many levels we have examined so far.

  /* save the number of index levels in the tree because it could be
     increased by a root split */
  num_index_levels = tree->num_index_levels;

  /* search the index and retrieve the node into which the record should
   be inserted */
  while (level < num_index_levels) {
    /* locate the link to the next node */
    for (i = 0; (key >= get_key(node, i)) && (i < node->num_recs-1); 
	 i++);

get_key

The for loop runs through the keys in the index and stops when it finds a key strictly larger than the key we want to insert (it also stops if we run out of keys). Note that there is a call to a function called get_key. get_key reads the index record at location i into an IS struct called input_buffer and returns the integer key from the record. input_buffer is assumed to be a global variable:

int get_key(PAGE node, int index_num) {
  read_rec(input_buffer, node, index_num);
  return atoi(input_buffer->fields[2]);
}

The decision not to explicitly pass input_buffer to get_key is somewhat questionable. If we passed it explicitly, a reader of the code would be explicitly reminded that a side effect of get_key is to read an index record into input_buffer. However, input_buffer is used ubiquitously throughout the program, so it seemed better to keep the size of the parameter list down by assuming that input_buffer is the buffer that will receive the index record. There is always a tension between explicitly passing parameters to a function, and hence explicitly showing all the variables from the caller that the function migh manipulate, and keeping the parameter list small by using some global variables. The choice of how to resolve this tension is one that comes with experience.

Split a Node?

Next we want to determine if the index node should be split. Before we do this however, we will save the location of the child node, since this location could be lost during the index node split.

    /* save the disk and page number of the child node */
    disk_num = atoi(input_buffer->fields[0]);
    page_num = atoi(input_buffer->fields[1]);

    /* split the index node if it is full */
    if ((node->num_recs - 1) == tree->max_keys)
      index_split(parent, parent_link, node, tree, level);

A Careful Swap

After checking for and possibly performing an index split, we are ready to move to the child. In an internal memory search, we would simply move the parent pointer to point at the current node and move the node pointer to point at the child node. However, this search is an external memory search and hence we don't have the child node already in memory. Thus we must find a page buffer into which the child node may be read. Since the current parent's page buffer is no longer needed, we can read the child node into the parent's page buffer. Before doing so, we make the parent pointer point to the current node's page buffer. To effect these changes, we swap the node and parent pointers:

    /* exchange the node and parent pointers. The parent pointer will now
       point to node. The current parent buffer is no longer needed
       so we can use it to hold node's child */ 
    swap_ptrs(&node, &parent);

Why do you think we passed the addresses of these pointers rather than the pointers themselves? The reason is that C uses call-by-value. If we simply passed the pointers and swapped them using the following code, the swap would be lost as soon as we returned from the swap function:

void swap_ptrs(PAGE node1, PAGE node2) {
   PAGE tmp;
 
   tmp = node1;
   node1 = node2;
   node2 = tmp;
}

To avoid losing the effect of the swap, we must pass the pointers' addresses. This leads to the following, correct, swap routine:

void swap_ptrs(PAGE *node1, PAGE *node2) {
  PAGE tmp;

  tmp = *node1;
  *node1 = *node2;
  *node2 = tmp;
}

On to the Child

Now that we've swapped the node and parent pointers, we are ready to read the child node. We also save the integer location of the link we followed in the parent so that we can easily split the child. If we split the child, the promoted key will be inserted right after this integer location.

    /* read node's child */
    read_disk(disk_num, node, page_num);

    /* saving the link we followed is helpful if we need to split the child */
    parent_link = i;
    level++;
  } /* end of the while */

Inserting the Record

Once we exit the index loop, we know that we have the appropriate record node in memory. Consequently, we want to insert the record into this node. Before doing so, we must reformat the node's page buffer so that it thinks that it is a record node rather than an index node. When we were reading index nodes, we assumed that the node had M_I records (M_I - 1 keys plus an extra record for the extra link at the end) with a record length equal to the length of an index record. Now that we have a record node, we must tell the disk package that the node has M_B records (M_B - 1 Btree records plus 1 header record) and that the record length is equal to the length of the records in the database:

  /* insert the record into the node, but first, change the formatting
     information so that the node is treated as a record node rather
     than an index node */
  node->max_recs = M_B;
  node->rec_length = RECORD_SIZE;

Once the reformatting is accomplished, we insert the record into the node, splitting it if necessary:

  if ((node->num_recs - 1) == tree->max_recs) {
    record_node_split(parent, parent_link, node, tree, key, record, level);
  }
  else {
    insert_rec(node, key, record);
    write_disk(node);
  }

Note that we are assuming that record_node_split writes the node out to disk but that insert_rec does not. Hence there is a write_disk call after insert_rec but not after record_node_split.

Finally we clean up the insert routine by destroying the page buffers we've been using and exiting:

  /* free the node and parent page_buffers */
  destroy_page_buffer(node);
  destroy_page_buffer(parent);
}

Inserting a Record via Insertion Sort

If there is enough space in a record node to accommodate a record, we insert it using insertion sort. We start at the end of the record node's page buffer and move records one position over until we reach the location where the record should be inserted. Moving the records one position over opens a hole in the page buffer which can then accommodate the new record:

/* use insertion sort to insert the record */
insert_rec(PAGE node, int key, char *new_record) {
  int i;
  int num_recs;

  num_recs = node->num_recs;
  /* use insertion sort to insert the new record */
  for (i = num_recs; (i > 1) && (key < get_rec_key(node, i-1)); i--)
      move_rec(node, i-1, i);
  write_rec(node, i, new_record);
}

Two things should be noted about this code:

We used move_rec to move records one position over.
We stopped our moving if i reaches 1, because in this case, all the records have been moved over one position. The header record is at position 0, so we want to insert the new record at position 1 if i reaches 1. It is very easy to get off by 1 errors in code like this. For example, it would be very easy to inadvertently start i at num_recs-1 rather than num_recs, or to continue while i is greater than 0 rather than 1. When you find that your page buffers don't seem to have quite the right contents, check your code to ensure you haven't made any off by 1 errors.

Splitting a Record Node--The Highlights

A fair amount of code is required to split a record node. If you're interested in all the details, see btree.c in the source directory mentioned at the beginning of these notes. There are two parts of the code of special interest however, since you will be implementing similar operations in your extendible hash tree lab.

The first operation involves allocating a page buffer for the new node and allocating disk space for it. The creation of the page buffer is performed via the disk package's make_page_buffer command. We pass the number of records the new node can contain (M_B rather than M_B-1 because the node contains M_B-1 database records plus 1 for the header node). The allocation of disk space involves a clever use of the mod operator. We keep a running count of the total number of pages in use thus far using a variable called page_count. Each time we want a new page we use the integer divide and mod operators to compute a free disk and page number, then increment the page_count variable so that it points to the next free space:

assign_disk_space(PAGE page_buffer) {
  page_buffer->disk = page_count / PAGE_LIMIT;
  page_buffer->page_num = page_count % PAGE_LIMIT;
  page_count++;
}

The code for creating a page buffer for the new node and assigning it disk space can now be written as:

  /* create the new node */
  new_node = make_page_buffer(M_B, RECORD_SIZE, '\0');
  assign_disk_space(new_node);

The second operation of interest is transferring records to the new node. We use two counters to perform this move--one keeps track of our location in the old node and one keeps track of our location in the new node. For no particular reason we move backwards in each node (i.e., start at the end of each node and move toward the beginning):

  /* transfer records to the new node */
  for (i = num_recs_to_move, j = node->num_recs-1; i >= 1; i--, j--) {
    read_rec(input_buffer, node, j);
    delete_rec(node, j);
    write_rec(new_node, i, input_buffer->text1);
  }

Two points should be made about this code:

The delete_rec command is critical. Without it, the record does not get deleted from the old node and your old node will still think it's full (and it will still think it has the transferred records).
The counter for the old node starts at num_recs-1 rather than num_recs because locations in a page are numbered from 0 to num_recs-1. Hence the records in the old page start at num_recs-1