CS302 Lecture Notes - Disjoint Sets


Reference Material Online


What do you need to know?


Implementation in C++

The API for disjoint sets is pretty minimal: The API is in DJ.h:

#include <vector>
#include <iostream>
using namespace std;

class Disjoint {
  public:
    Disjoint(int nelements);
    int Union(int s1, int s2);
    int Find(int element);
    void Print();
  protected:
    vector <int> links;
    vector <int> ranks;
};

The links data structure holds the parent pointers for each element. If links[e] is equal to negative one, then e is the root and set id of the set. If links[e] does not equal -1, then the set id of e is equal to the set id of links[e].

The ranks vector holds additional information:

In all cases, if e is not the root of a set, ranks[e] is immaterial.

I have three implementations:

I will start with union-by-size, and then show the differences with the others. The differences are very minor.

The constructor sets up the two vectors. Each element is in its own set, so all links are -1 and all ranks are 1.

Disjoint::Disjoint(int nelements)
{
  links.resize(nelements, -1);
  ranks.resize(nelements, 1);
}

The Find(e) operator chases link[e] until it equals -1:

int Disjoint::Find(int element)
{
  while (links[element] != -1) element = links[element];
  return element;
}

And the Union(s1, s2) operator first checks to make sure that the set id's are valid, and then chooses a parent and a child from s1 and s2. The parent will be the one with the bigger of the two sets. It changes the link field of the child to point to the parent, and then it updates the size of the parent in the ranks vector:

int Disjoint::Union(int s1, int s2)
{
  int p, c;

  if (links[s1] != -1 || links[s2] != -1) {
    cerr << "Must call union on a set, and not just an element.\n";
    exit(1);
  }
  if (ranks[s1] > ranks[s2]) {
    p = s1;
    c = s2;
  } else {
    p = s2;
    c = s1;
  }
  links[c] = p;
  ranks[p] += ranks[c];    /* HERE */
  return p;
}

I won't show Print(): it simply prints out the vectors.

The only difference between union-by-size and union-by-height is that ranks keeps track of the number of nodes in the longest path. It is a one line change to union-by-size -- the line marked HERE is changed to: DJ-height.cpp if (ranks[s1] == ranks[s2]) ranks[p]++;

This is because a set's height only changes if the two sets being merged have equal heights.

Finally, union-by-rank is equivalent to union-by-height, except that you perform path compression on find operations. With path compression, each time you perform a Find(e) operation, you update the links field of all elements on the path to the root, so that they equal the root. I do this with a vector that holds all the non-root elements in the path:

int Disjoint::Find(int element)
{
  vector <int> q;
  int i;

  while (links[element] != -1) {
    q.push_back(element);
    element = links[element];
  }
  for (i = 0; i < q.size(); i++) links[q[i]] = element;
  return element;
}

This is one of those convenient things about the STL -- I don't have to call new or delete. When the Find() operation is over, the vector is deallocated.

I could implement path compression in two other ways. The first is with simple recursion:

int Disjoint::Find(int element)
{
  if (links[element] == -1) return element;
  links[element] = Find(links[element]);
  return links[element];
}

The second is to traverse links to the root, but while doing so, setting links[element] to be element's child. In that way, once you find the root, you can use links to go back to the original element, performing path compression along the way. The code is here -- if you're a little leery of this code, copy it to your directory and put in some print statements. This should be the best implementation performance-wise, because it doesn't use extra memory like the other two.

int Disjoint::Find(int e)
{
  int p, c;   // P is the parent, c is the child.
  c = -1;
  while (links[e] != -1) {
    p = links[e];
    links[e] = c;
    c = e;
    e = p;
  }

  p = e;
  e = c;
  while (e != -1) {
    c = links[e];
    links[e] = p;
    e =c;
  }
  return p;
}


Example of Use

The program dj-ex1.cpp shows a simple example of using the API. I won't put the code inline here -- just click on the link to see it. The makefile compiles it into three executables -- one for each implementation:
UNIX> make
g++ -c -O dj-ex1.cpp
g++ -c -O DJ-size.cpp
g++ -O -o dj-ex1-size dj-ex1.o DJ-size.o
g++ -c -O DJ-height.cpp
g++ -O -o dj-ex1-height dj-ex1.o DJ-height.o
g++ -c -O DJ-rank.cpp
g++ -O -o dj-ex1-rank dj-ex1.o DJ-rank.o
UNIX> 
We first run it with union-by-size. Let's look at the output incrementally. When the program starts, it sets up an empty Disjoint with ten elements:

UNIX> dj-ex1-size
Starting State:

Elts:   0  1  2  3  4  5  6  7  8  9
Links: -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Ranks:  1  1  1  1  1  1  1  1  1  1

Next, it performs three union operations: Union(0, 1), Union(2, 3), and Union(4, 5). Since each set in all three operations is the same size, the choice of parent and child is arbitrary. Here's the output and how it looks pictorally (I've added the sizes to the roots of each set):

Doing d.Union(0, 1).  Resulting set = 1
Doing d.Union(2, 3).  Resulting set = 3
Doing d.Union(4, 5).  Resulting set = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1 -1  3 -1  5 -1 -1 -1 -1 -1
Ranks:  1  2  1  2  1  2  1  1  1  1

Next it performs four more union operations: Union(1, 3), Union(5, 6), Union(5, 7), and Union(5, 8). The first union operation merges two sets of the same size, so the parent/child selection is arbitrary. The remaining three union operations merge sets of size 1 (sets 6, 7 and 8) with set 5 which is larger. Thus, in each case, set 5 becomes the parent. The resulting sets are pictured to the right.

The Find() operations return the root of each set -- three in the set {0, 1, 2, 3}, and five in the set {4, 5, 6, 7, 8}.

You should make sure that you understand how the output of the program maps to the picture. In particular, make sure you understand the Links and Ranks lines and what they mean.

Doing d.Union(1, 3).  Resulting set = 3
Doing d.Union(5, 6).  Resulting set = 5
Doing d.Union(5, 7).  Resulting set = 5
Doing d.Union(5, 8).  Resulting set = 5
d.Find(1) = 3
d.Find(2) = 3
d.Find(4) = 5
d.Find(7) = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5 -1  5  5  5 -1
Ranks:  1  2  1  4  1  5  1  1  1  1

Now, we perform Union(3, 5). Since set 5 has more elements than set 3, it is the parent and 3 is the child. Subsequent Find() operations on 3, 5, 7 and 0 all return 5 as the set id:

Doing d.Union(3, 5).  Resulting set = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3  5  5 -1  5  5  5 -1
Ranks:  1  2  1  4  1  9  1  1  1  1

d.Find(3) = 5
d.Find(5) = 5
d.Find(7) = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3  5  5 -1  5  5  5 -1
Ranks:  1  2  1  4  1  9  1  1  1  1

d.Find(0) = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3  5  5 -1  5  5  5 -1
Ranks:  1  2  1  4  1  9  1  1  1  1

UNIX> 


Now, when we run this on union-by-height, the output is the same until the third picture:

UNIX> dj-ex1-height 
Starting State:

Elts:   0  1  2  3  4  5  6  7  8  9
Links: -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Ranks:  1  1  1  1  1  1  1  1  1  1
Doing d.Union(0, 1).  Resulting set = 1
Doing d.Union(2, 3).  Resulting set = 3
Doing d.Union(4, 5).  Resulting set = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1 -1  3 -1  5 -1 -1 -1 -1 -1
Ranks:  1  2  1  2  1  2  1  1  1  1
Doing d.Union(1, 3).  Resulting set = 3
Doing d.Union(5, 6).  Resulting set = 5
Doing d.Union(5, 7).  Resulting set = 5
Doing d.Union(5, 8).  Resulting set = 5
d.Find(1) = 3
d.Find(2) = 3
d.Find(4) = 5
d.Find(7) = 5

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5 -1  5  5  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

Although the trees look the same, the ranks fields are different, now holding heights rather than sizes. So, when we perform the last union of 3 and 5, 3 becomes the parent, since it has greater height. Subsequent Find() operations all return 3 now:

Doing d.Union(3, 5).  Resulting set = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5  3  5  5  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

d.Find(3) = 3
d.Find(5) = 3
d.Find(7) = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5  3  5  5  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

d.Find(0) = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5  3  5  5  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

UNIX> 


Lastly, with path compression (union-by-rank), all the output is the same until those last Find() operations. We'll start with the last Union, where the state of the data structure is the same as above:

UNIX> dj-ex1-rank 
....
....
Doing d.Union(3, 5).  Resulting set = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5  3  5  5  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

When we perform the three Find() operations, the last one -- Find(7) performs path compression, setting node 7's link to the root of the set: 3:

d.Find(3) = 3
d.Find(5) = 3
d.Find(7) = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  1  3  3 -1  5  3  5  3  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

Similarly, the last Find(0) operation also performs path compression:

d.Find(0) = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  3  3  3 -1  5  3  5  3  5 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

Were we to call Find(4) Find(6) and Find(8), then those nodes too would perform path compression and point directly to node three. In that case, the state would be the following:

d.Find(4) = 3
d.Find(6) = 3
d.Find(8) = 3

Elts:   0  1  2  3  4  5  6  7  8  9
Links:  3  3  3 -1  3  3  3  3  3 -1
Ranks:  1  2  1  3  1  2  1  1  1  1

I draw this picture because you should see that ranks[3] remains at three, even though its height is two. This is because the ranks field traces what the height of the tree would be with no path compression. We can't keep it updated properly without adding to the running time of the Union() or Find() operations. Fortunately, it doesn't matter -- the fine theoreticians of the world have proved that Find() operations run in O(α(n)) time. Union() operations are still O(1).


Maze Generation Program with Disjoint Sets

Generating mazes is a neat application of Disjoint Sets. A maze, like the one pictured below, may be viewed as an undirected graph, where each cell in a grid is a node in the graph, and if one can move horizontally or vertically from one cell to the next, there is an edge between the nodes. If there is a wall between two cells, then there is no edge.

A good maze is one where the graph is fully connected, so that every cell is reachable from the start/end cells, but there are no cycles. We can generate such a maze using disjoint sets. We start with a completely disconnected graph, where each cell is surrounded by walls. If this graph has r rows and c columns, then the graph contains r*c nodes and no edges.

What we'll do is choose a random wall to remove. If that wall separates nodes in different connected components, then we'll remove it, thereby lowering the number of connected components. If it doesn't separate nodes in different connected components, we keep it.

This can be done with disjoint sets. We start with each cell in its own set, and then we choose a random wall. If that wall connects two nodes in different sets, we remove the wall and call Union() on the two sets. Otherwise, we keep the wall. We keep doing this until we have just one set.

The code is in maze-gen.cpp. It's a little tricky. We first generate all the walls. Walls that separate vertically adjacent cells are indexed by the smaller cell number. Walls that separate horizontally adjacent cells are indexed by the smaller cell number plus r*c. We generate all the walls and insert them into a multiset keyed by a random number. Then we traverse the multiset, deleting walls if they separate different components, until we have just one component. Then we print out the walls:

#include <vector>
#include <cstdlib>
#include <map>
#include "DJ.h"
#include <iostream>
using namespace std;

typedef multimap <double, int> DIMap;
typedef DIMap::iterator DIMit;

main(int argc, char **argv)
{
  int r, c, row, column, c1, c2, ncomp, s1, s2, hov;
  Disjoint *d;
  DIMap walls;
  DIMit wit;
  DIMit tmp;

  if (argc != 3) { fprintf(stderr, "Bad dog\n"); exit(1); }

  r = atoi(argv[1]);
  c = atoi(argv[2]);

  d = new Disjoint(r*c);

  for (row = 0; row < r-1; row++) {      // Generate walls that separate vertical cells.
    for (column = 0; column < c; column++) {
      c1 = row*c + column;
      walls.insert(make_pair(drand48(), c1));
    }
  }

  for (row = 0; row < r; row++) {      // Generate walls that separate horizontal cells.
    for (column = 0; column < c-1; column++) {
      c1 = (row*c + column) + r*c;
      walls.insert(make_pair(drand48(), c1));
    }
  }

  ncomp = r*c;
  wit = walls.begin();
  while (ncomp > 1) {
    c1 = wit->second;
    if (c1 < r*c) {    // This is a wall separating vertical cells
      c2 = c1 + c;
    } else {              // This is a wall separating horizontal cells
      c1 -= r*c;
      c2 = c1+1;
    }
    s1 = d->Find(c1);
    s2 = d->Find(c2);
    if (s1 != s2) {       // Test for different connected components.
      d->Union(s1, s2);
      tmp = wit;
      wit++;
      walls.erase(tmp);
      ncomp--;
    } else {
      wit++;
    }
  }

  printf("ROWS %d COLS %d\n", r, c);
  for (wit = walls.begin(); wit != walls.end(); wit++) {
    c1 = wit->second;
    if (c1 < r*c) {
      c2 = c1 + c;
    } else {
      c1 -= r*c;
      c2 = c1+1;
    }
    printf("WALL %d %d\n", c1, c2);
  }
}

We can run this and pipe the output to the program maze_ppm (from a CS302 lab that you may not have done yet), and that lets us generate mazes of all sizes:

UNIX> maze-gen-size 50 100 | maze_ppm 5 | convert - maze2.jpg