CS302 Lecture Notes - Graph Depth-First Search


Topcoder problems with DFS (as extra practice, via Dr. Plank)


Depth First Search

Depth-First Search (abbreviated "DFS") is one of the most basic graph algorithms. With DFS, each node has a boolean value called visited. Before the DFS, that value is set to false, for all nodes. Then, you commence a DFS on a node. Let's call that node n. The DFS works as follows: When you call DFS on a node n, the DFS will visit every node connected to n. For that reason, DFS is good for activities that involve connectivity. We'll see one of those in the next section.

Depth First Search To Count Connected Components

Two nodes are in the same connected component if there is a path between them. Thus, a graph may be partitioned into its connected components. To discover all the nodes connected to a given node, you can perform a depth first search on that node. Thus, if you want to identify all of the connected components of a graph, you can do that with one DFS for each component of the graph.

This results in the following algorithm for determining connected components. First, you read in a graph. Then you set all visited fields to zero. Then you traverse all the nodes in the graph, and whenever you encounter one whose visited field is zero, you perform the connected component depth first search on it. The total number of depth first searches is equal to the number of connected components in the graph.

The code is in concomp.cpp. (Dr. Plank has a second implementation that has pointers in the adjacency lists rather than integers, because sometimes he uses that. If it helps look at this alternative file to see that program, and an explanation of how it differs.)

In this program, we have separate classes for graphs and nodes. Graphs contain a vector of nodes (pointers to nodes, really). Each node has an adjacency list called edges, and this is a list of node numbers, as described above. The DFS is called Component_Count(), and it takes two parameters -- the node on which to perform the DFS (n), and the component number. The DFS uses a field component in each node as its visited field; however, instead of setting it to false initially, it is set to -1. And instead of setting it to true during the DFS, it is set to the component number.

The code is below. It is pretty straightforward:

#include <iostream>
#include <string>
#include <vector>
#include <stdlib.h>
using namespace std;

class Node {
  public:
    int id;
    vector <int> edges;
    int component;
};

class Graph {
  public:
    vector <Node *> nodes; // Note, this vector is of *pointers* to nodes.
    void Print();
    void Component_Count(int index, int cn);
};

// This is the DFS, where the component variable is used as the visited field.

void Graph::Component_Count(int index, int cn)
{
  Node *n;
  int i;

  n = nodes[index];
  if (n->component != -1) return;
  n->component = cn;
  for (i = 0; i < n->edges.size(); i++) Component_Count(n->edges[i], cn);
}

// The Print() function prints each node, its component number and its adjacency list.

void Graph::Print()
{
  int i, j;
  Node *n;

  for (i = 0; i < nodes.size(); i++) {
    n = nodes[i];
    cout << "Node " << i << ": " << n->component << ":";
    for (j = 0; j < n->edges.size(); j++) {
      cout << " " << n->edges[j];
    }
    cout << endl;
  }
}


// The main() function reads in a graph, does the connected component
// determination, and then prints the graph.

main(int argc, char **argv)
{
  Graph g;
  string s;
  int nn, n1, n2, i, c;
  Node *n;

  cin >> s;
  if (s != "NNODES") { cerr << "Bad graph\n"; exit(1); }
  cin >> nn;

  // Create all of the nodes.

  for (i = 0; i < nn; i++) {
    n = new Node;
    n->component = -1;
    n->id = i;
    g.nodes.push_back(n);
  }

  // Read all of the edges.  There is no error checking here,
  // so the input has to be correct or this program will not work.

  while (!cin.fail()) {
    cin >> s >> n1 >> n2;
    if (!cin.fail()) {
      if (s != "EDGE") { cerr << "Bad graph\n"; exit(1); }
      g.nodes[n1]->edges.push_back(n2);
      g.nodes[n2]->edges.push_back(n1);
    }
  }

  // Traverse the nodes.  If you find one whose component number has
  // not yet been determined, run the DFS on it.

  c = 0;
  for (i = 0; i < g.nodes.size(); i++) {
    if (g.nodes[i]->component == -1) {
      c++;
      g.Component_Count(i, c);
     }
  }

  g.Print();
}

As we can see, it works fine on our two example files. Pay attention to the output. Each line prints a node, its connected component number, and its adjacency list. Make sure you understand the output and how it relates to the pictures.

UNIX> concomp < g1.txt
Node 0: 1:
Node 1: 2: 3
Node 2: 3:
Node 3: 2: 5 1
Node 4: 4: 9 6 7
Node 5: 2: 3
Node 6: 4: 4 8
Node 7: 4: 4
Node 8: 4: 6
Node 9: 4: 4
UNIX> concomp < g2.txt
Node 0: 1: 3
Node 1: 1: 2
Node 2: 1: 1 7 9
Node 3: 1: 7 0
Node 4: 2:
Node 5: 1: 9 8 7
Node 6: 1: 8
Node 7: 1: 3 2 5
Node 8: 1: 5 6
Node 9: 1: 5 2
UNIX> 
The first call identifies the connected components as: It's not a bad idea to copy this file over and put some print statements in so that you can visualize the depth first search.

What's the running time? O(|V| + |E|). This covers two cases:

Thus, we say that counting connected components is linear in the number of vertices and edges. (We learned about connected components when we learned about disjoint sets too. We can easily count connected components with disjoint sets, and the running time is O(|E| α(|V|)). So, why would we ever use disjoint sets? The answer is that sometimes the problem asks you to do the connected component identification incrementally, and for that, disjoint sets are far superior to depth-first search. Think about that.)

Depth First Search To Perform Cycle Detection

Cycle detection is another depth first search. Here we also set a visited field; however, if we now encounter a node whose visited field is set, we know that the node is part of a cycle, and we return that fact. Again, it's a simple search, and I put the relevant code below (in cycledet0.cpp):

class Graph {
  public:
    vector <Node *> nodes;
    void Print();
    int is_cycle(int index);
};

int Graph::is_cycle(int index)
{
  Node *n;
  int i;

  n = nodes[index];
  if (n->visited) return 1;
  n->visited = 1;
  for (i = 0; i < n->edges.size(); i++) {
    if (is_cycle(n->edges[i])) return 1;
  }
  return 0;
}

main(int argc, char **argv)
{
  ...

  for (i = 0; i < g.nodes.size(); i++) {
    if (!g.nodes[i]->visited) {
      if (g.is_cycle(i)) {
        cout << "There is a cycle reachable from node " << i << endl;
      } else {
        cout << "No cycle reachable from node " << i << endl;
      }
    }
  }
}

Note that unlike connected components, this procedure has a return value, and it uses that return value to truncate the search when a cycle is found.

When we run it, we see that it doesn't work correctly, as it says that g1 has a bunch of cycles, when we know that it doesn't:

UNIX> cycledet0 < g1.txt
No cycle reachable from node 0
There is a cycle reachable from node 1
No cycle reachable from node 2
There is a cycle reachable from node 4
There is a cycle reachable from node 6
There is a cycle reachable from node 7
There is a cycle reachable from node 8
UNIX> 
Hmmm -- in cycledet1.cpp I put a print statement at the beginning of is_cycle():
UNIX> cycledet1 < g1.txt
Called is_cycle(0)
No cycle reachable from node 0
Called is_cycle(1)
Called is_cycle(3)
Called is_cycle(5)
Called is_cycle(3)
There is a cycle reachable from node 1
...
There's the bug. The program first visits node 0 and finds no cycle. Then it visits node 1 and recursively visits nodes 3 and 5. Since node 5 has an edge back to node 3, it detects a cycle there. How do we fix this bug?

One simple way is to include who calls is_cycle() as a parameter so that is_cycle() will not detect cycles that include the same edge twice. Here's the changed procedure and call from main() in cycledet2.cpp

int Graph::is_cycle(int index, int from)
{
  Node *n;
  int i;

  n = nodes[index];
  if (n->visited) return 1;
  n->visited = 1;
  for (i = 0; i < n->edges.size(); i++) {
    if (n->edges[i] != from) {
      if (is_cycle(n->edges[i], index)) return 1;
    }
  }
  return 0;
}

main(int argc, char **argv)
{
  ...
  for (i = 0; i < g.nodes.size(); i++) {
    if (!g.nodes[i]->visited) {
      if (g.is_cycle(i, -1)) {
        cout << "There is a cycle reachable from node " << i << endl;
      } else {
        cout << "No cycle reachable from node " << i << endl;
      }

    }
  }
}

All works well now:

UNIX> cycledet2 < g1.txt
No cycle reachable from node 0
No cycle reachable from node 1
No cycle reachable from node 2
No cycle reachable from node 4
UNIX> cycledet2 < g2.txt
There is a cycle reachable from node 0
No cycle reachable from node 4
UNIX> 
If you want to print the cycle, then you can start from when you first detect the cycle, and then stop when you reach the node from whence you detected the cycle. That's in cycledet3.cpp. Note, when I detect the cycle, I set the visited field to two. That is how I know when to stop printing and exit the program:

int Graph::is_cycle(int index, int from)
{
  Node *n;
  int i;
  int rv;

  n = nodes[index];
  if (n->visited) {
    n->visited = 2;
    cout << "Cycle: " << index;
    return 1;
  }
  n->visited = 1;
  for (i = 0; i < n->edges.size(); i++) {
    if (n->edges[i] != from) {
      if (is_cycle(n->edges[i], index)) {
        cout << " " << index;
        if (n->visited == 2) {
          cout << endl;
          exit(1);
        }
        return 1;
      }
    }
  }
  return 0;
}

UNIX> cycledet3 < g2.txt
Cycle: 7 5 9 2 7
UNIX>