The Disjoint Set data structure solves a specific problem that is interesting both theoretically and practically. The problem is as follows:
You have a collection of n items, which you number from 0 to n1. These items will be partitioned into some number of sets. The sets are "disjoint" which means that no item belongs to more than one set. All items belong to some set though (hence the use of the word "partition.").
There are two operations that you can perform:
Disjoint sets are very useful in connected component applications. They are also extremely efficient (we'll talk about that later).
When a node has a NULL link, we call it the "root" of a set. If you call Find() on a node with a NULL link, it will return the node's item number, and that is the set id of the node. Therefore, when you first start, every node is the root of its own set, and when you call Find(i), it will return i.
When you call Union(i, j), remember that i and j must be set id's. Therefore, they must be nodes with NULL links. What you do is have one of those nodes set its link to the other node.
Let's illustrate with a simple example. We initialize an instance of disjoint sets with 10 items. Each item is a node with a number from 0 to 9. Each node has a NULL link, which we depict by not drawing any arrows from the node:
Again, each node is in its own set, and each node's set id is its number. Suppose we call Union(0, 1), Union(2, 3) and Union(4, 5). These will each set one of the node's link to the other node. We'll talk about how that gets done later. However, suppose this is the result:
As you can see, node 0's link has been set to 1. Both of these nodes' set ids are now 1, which means Find(0) equals Find(1) equals one. Similarly, Find(2) equals Find(3) equals three.
This gives you a clue about implementing Find(). When you call Find(n), what you do is keep setting n to n's link, until n's link is NULL. At that point, you are at the root of the set, and you return n.
Union is pretty simple, too, but you have some choices about how to determine which node sets its link field to the other. We use three methods to do this:
As always, a picture helps. Suppose this is the state of our disjoint set instance:
There are two sets, with set id's 5 and 9. Now, suppose you call Find(0). It will return five, but along the way to the root node of its set, it encounters nodes 1 and 3. Before returning five, it sets the links to 0, 1 and 3 to five:
Do you see why this is a good thing? Previously, when you called Find(0), you needed to travel through nodes 1 and 3 before getting to 5. If you call Find(0) again, you get to node 5 directly. Similarly, you have improved the performance of Find(1), and Find(2).
You can see that path compression has altered the height of the set. However, we maintain what its height would be, had we not used path compression, and call it the set's rank. We use the rank to determine how we perform union.
#include <vector> #include <iostream> using namespace std; class Disjoint { public: Disjoint(int nelements); int Union(int s1, int s2); int Find(int element); void Print(); protected: vector <int> links; vector <int> ranks; }; 
The links data structure holds the parent pointers for each element. If links[e] is equal to negative one, then e is the root and set id of the set. If links[e] does not equal 1, then the set id of e is equal to the set id of links[e].
The ranks vector holds additional information:
In all cases, if e is not the root of a set, ranks[e] is immaterial.
I have three implementations:
The constructor sets up the two vectors. Each element is in its own set, so all links are 1 and all ranks are 1.
Disjoint::Disjoint(int nelements) { links.resize(nelements, 1); ranks.resize(nelements, 1); } 
The Find(e) operator chases link[e] until it equals 1:
int Disjoint::Find(int element) { while (links[element] != 1) element = links[element]; return element; } 
And the Union(s1, s2) operator first checks to make sure that the set id's are valid, and then chooses a parent and a child from s1 and s2. The parent will be the one with the bigger of the two sets. It changes the link field of the child to point to the parent, and then it updates the size of the parent in the ranks vector:
int Disjoint::Union(int s1, int s2) { int p, c; if (links[s1] != 1  links[s2] != 1) { cerr << "Must call union on a set, and not just an element.\n"; exit(1); } if (ranks[s1] > ranks[s2]) { p = s1; c = s2; } else { p = s2; c = s1; } links[c] = p; ranks[p] += ranks[c]; /* HERE */ return p; } 
I won't show Print(): it simply prints out the vectors.
The only difference between unionbysize and unionbyheight is that ranks keeps track of the number of nodes in the longest path. It is a one line change to unionbysize  the line marked HERE is changed to: DJheight.cpp if (ranks[s1] == ranks[s2]) ranks[p]++;
This is because a set's height only changes if the two sets being merged have equal heights.
Finally, unionbyrank is equivalent to unionbyheight, except that you perform path compression on find operations. With path compression, each time you perform a Find(e) operation, you update the links field of all elements on the path to the root, so that they equal the root. I do this with a vector that holds all the nonroot elements in the path:
int Disjoint::Find(int element) { vector <int> q; int i; while (links[element] != 1) { q.push_back(element); element = links[element]; } for (i = 0; i < q.size(); i++) links[q[i]] = element; return element; } 
This is one of those convenient things about the STL  I don't have to call new or delete. When the Find() operation is over, the vector is deallocated.
I could implement path compression in two other ways. The first is with simple recursion:
int Disjoint::Find(int element) { if (links[element] == 1) return element; links[element] = Find(links[element]); return links[element]; } 
The second is to traverse links to the root, but while doing so, setting links[element] to be element's child. In that way, once you find the root, you can use links to go back to the original element, performing path compression along the way. The code is here  if you're a little leery of this code, copy it to your directory and put in some print statements. This should be the best implementation performancewise, because it doesn't use extra memory like the other two.
int Disjoint::Find(int e) { int p, c; // P is the parent, c is the child. c = 1; while (links[e] != 1) { p = links[e]; links[e] = c; c = e; e = p; } p = e; e = c; while (e != 1) { c = links[e]; links[e] = p; e =c; } return p; } 
UNIX> make g++ c O djex1.cpp g++ c O DJsize.cpp g++ O o djex1size djex1.o DJsize.o g++ c O DJheight.cpp g++ O o djex1height djex1.o DJheight.o g++ c O DJrank.cpp g++ O o djex1rank djex1.o DJrank.o UNIX>We first run it with unionbysize. Let's look at the output incrementally. When the program starts, it sets up an empty Disjoint with ten elements:
UNIX> djex1size Starting State: Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 1 1 1 1 1 1 1 1 1 Ranks: 1 1 1 1 1 1 1 1 1 1 
Next, it performs three union operations: Union(0, 1), Union(2, 3), and Union(4, 5). Since each set in all three operations is the same size, the choice of parent and child is arbitrary. Here's the output and how it looks pictorally (I've added the sizes to the roots of each set):
Doing d.Union(0, 1). Resulting set = 1 Doing d.Union(2, 3). Resulting set = 3 Doing d.Union(4, 5). Resulting set = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 1 3 1 5 1 1 1 1 1 Ranks: 1 2 1 2 1 2 1 1 1 1 
Next it performs four more union operations: Union(1, 3), Union(5, 6), Union(5, 7), and Union(5, 8). The first union operation merges two sets of the same size, so the parent/child selection is arbitrary. The remaining three union operations merge sets of size 1 (sets 6, 7 and 8) with set 5 which is larger. Thus, in each case, set 5 becomes the parent. The resulting sets are pictured to the right.
The Find() operations return the root of each set  three in the set {0, 1, 2, 3}, and five in the set {4, 5, 6, 7, 8}.
You should make sure that you understand how the output of the program maps to the picture. In particular, make sure you understand the Links and Ranks lines and what they mean.
Doing d.Union(1, 3). Resulting set = 3 Doing d.Union(5, 6). Resulting set = 5 Doing d.Union(5, 7). Resulting set = 5 Doing d.Union(5, 8). Resulting set = 5 d.Find(1) = 3 d.Find(2) = 3 d.Find(4) = 5 d.Find(7) = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 1 5 5 5 1 Ranks: 1 2 1 4 1 5 1 1 1 1 
Now, we perform Union(3, 5). Since set 5 has more elements than set 3, it is the parent and 3 is the child. Subsequent Find() operations on 3, 5, 7 and 0 all return 5 as the set id:
Doing d.Union(3, 5). Resulting set = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 5 5 1 5 5 5 1 Ranks: 1 2 1 4 1 9 1 1 1 1 d.Find(3) = 5 d.Find(5) = 5 d.Find(7) = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 5 5 1 5 5 5 1 Ranks: 1 2 1 4 1 9 1 1 1 1 d.Find(0) = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 5 5 1 5 5 5 1 Ranks: 1 2 1 4 1 9 1 1 1 1 UNIX> 
UNIX> djex1height Starting State: Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 1 1 1 1 1 1 1 1 1 Ranks: 1 1 1 1 1 1 1 1 1 1 

Doing d.Union(0, 1). Resulting set = 1 Doing d.Union(2, 3). Resulting set = 3 Doing d.Union(4, 5). Resulting set = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 1 3 1 5 1 1 1 1 1 Ranks: 1 2 1 2 1 2 1 1 1 1 

Doing d.Union(1, 3). Resulting set = 3 Doing d.Union(5, 6). Resulting set = 5 Doing d.Union(5, 7). Resulting set = 5 Doing d.Union(5, 8). Resulting set = 5 d.Find(1) = 3 d.Find(2) = 3 d.Find(4) = 5 d.Find(7) = 5 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 1 5 5 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 
Although the trees look the same, the ranks fields are different, now holding heights rather than sizes. So, when we perform the last union of 3 and 5, 3 becomes the parent, since it has greater height. Subsequent Find() operations all return 3 now:
Doing d.Union(3, 5). Resulting set = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 3 5 5 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 d.Find(3) = 3 d.Find(5) = 3 d.Find(7) = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 3 5 5 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 d.Find(0) = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 3 5 5 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 UNIX> 
UNIX> djex1rank .... .... Doing d.Union(3, 5). Resulting set = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 3 5 5 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 
When we perform the three Find() operations, the last one  Find(7) performs path compression, setting node 7's link to the root of the set: 3:
d.Find(3) = 3 d.Find(5) = 3 d.Find(7) = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 1 3 3 1 5 3 5 3 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 
Similarly, the last Find(0) operation also performs path compression:
d.Find(0) = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 3 3 3 1 5 3 5 3 5 1 Ranks: 1 2 1 3 1 2 1 1 1 1 
Were we to call Find(4) Find(6) and Find(8), then those nodes too would perform path compression and point directly to node three. In that case, the state would be the following:
d.Find(4) = 3 d.Find(6) = 3 d.Find(8) = 3 Elts: 0 1 2 3 4 5 6 7 8 9 Links: 3 3 3 1 3 3 3 3 3 1 Ranks: 1 2 1 3 1 2 1 1 1 1 
I draw this picture because you should see that ranks[3] remains at three, even though its height is two. This is because the ranks field traces what the height of the tree would be with no path compression. We can't keep it updated properly without adding to the running time of the Union() or Find() operations. Fortunately, it doesn't matter  the fine theoreticians of the world have proved that Find() operations run in O(α(n)) time. Union() operations are still O(1).
A good maze is one where the graph is fully connected, so that every cell is reachable from the start/end cells, but there are no cycles. We can generate such a maze using disjoint sets. We start with a completely disconnected graph, where each cell is surrounded by walls. If this graph has r rows and c columns, then the graph contains r*c nodes and no edges.
What we'll do is choose a random wall to remove. If that wall separates nodes in different connected components, then we'll remove it, thereby lowering the number of connected components. If it doesn't separate nodes in different connected components, we keep it.
This can be done with disjoint sets. We start with each cell in its own set, and then we choose a random wall. If that wall connects two nodes in different sets, we remove the wall and call Union() on the two sets. Otherwise, we keep the wall. We keep doing this until we have just one set.
The code is in mazegen.cpp. It's a little tricky. We first generate all the walls. Walls that separate vertically adjacent cells are indexed by the smaller cell number. Walls that separate horizontally adjacent cells are indexed by the smaller cell number plus r*c. We generate all the walls and insert them into a multiset keyed by a random number. Then we traverse the multiset, deleting walls if they separate different components, until we have just one component. Then we print out the walls:
#include <vector> #include <cstdlib> #include <map> #include "DJ.h" #include <iostream> using namespace std; typedef multimap <double, int> DIMap; typedef DIMap::iterator DIMit; main(int argc, char **argv) { int r, c, row, column, c1, c2, ncomp, s1, s2, hov; Disjoint *d; DIMap walls; DIMit wit; DIMit tmp; if (argc != 3) { fprintf(stderr, "Bad dog\n"); exit(1); } r = atoi(argv[1]); c = atoi(argv[2]); d = new Disjoint(r*c); for (row = 0; row < r1; row++) { // Generate walls that separate vertical cells. for (column = 0; column < c; column++) { c1 = row*c + column; walls.insert(make_pair(drand48(), c1)); } } for (row = 0; row < r; row++) { // Generate walls that separate horizontal cells. for (column = 0; column < c1; column++) { c1 = (row*c + column) + r*c; walls.insert(make_pair(drand48(), c1)); } } ncomp = r*c; wit = walls.begin(); while (ncomp > 1) { c1 = wit>second; if (c1 < r*c) { // This is a wall separating vertical cells c2 = c1 + c; } else { // This is a wall separating horizontal cells c1 = r*c; c2 = c1+1; } s1 = d>Find(c1); s2 = d>Find(c2); if (s1 != s2) { // Test for different connected components. d>Union(s1, s2); tmp = wit; wit++; walls.erase(tmp); ncomp; } else { wit++; } } printf("ROWS %d COLS %d\n", r, c); for (wit = walls.begin(); wit != walls.end(); wit++) { c1 = wit>second; if (c1 < r*c) { c2 = c1 + c; } else { c1 = r*c; c2 = c1+1; } printf("WALL %d %d\n", c1, c2); } } 
We can run this and pipe the output to the program maze_ppm (from a CS302 lab that you may not have done yet), and that lets us generate mazes of all sizes:
UNIX> mazegensize 50 100  maze_ppm 5  convert  maze2.jpg 
#include <vector> #include <cstdlib> #include "DJ.h" #include <iostream> using namespace std; main() { Disjoint d(8); int s01, s23, s45, s67; s01 = d.Union(0, 1); s23 = d.Union(2, 3); s45 = d.Union(4, 5); s67 = d.Union(6, 7); s01 = d.Union(s01, s23); s45 = d.Union(s45, s67); s01 = d.Union(s01, s45); d.Print(); printf("\n"); d.Find(0); d.Print(); exit(0); } 
When I compile this with DJrank.cpp and run it, the first lines are:
UNIX> exampleexam Elts: 0 1 2 3 4 5 6 7 Links: 1 3 3 7 5 7 7 1 Ranks: 1 2 1 3 1 2 1 4Draw the data structure (as circles and pointers) just before the d.Find() call. Then give me the output of the last d.Print() call.
When you call d.Find(0), path compression occurs, which means that nodes 0, 1 (and 3) all point to the root (7):
Thus, the links fields for 0 and 1 will become 7. Everything else remains the same, because Find() doesn't change the ranks. So the output is:
Elts: 0 1 2 3 4 5 6 7 Links: 7 7 3 7 5 7 7 1 Ranks: 1 2 1 3 1 2 1 4