CS494 Lecture Notes

CS494 Lecture Notes - PageRank

James S. Plank
Directory: /home/jplank/cs494/Notes/PageRank
Original notes: February, 2016
Most recent revision: Sat Feb 6 00:08:11 EST 2016

Reference Material

The main reference material for PageRank is a wonderful column written by David Austin for the American Mathematical Society: http://www.ams.org/samplings/feature-column/fcarc-pagerank. My in-class lecture follows this article.

Implementations

As with many of the algorithms in this class, I don't feel it's fruitful to go through all of the esoterica of a complete implementation. However, I do think it's useful to go through the highlights, and then use the implementation to explore a little.

So.

My first implementation is in src/Pagerank-1-Read-Graph.cpp. All this does is read a graph and print it out. The format of our graphs is really simple -- we simply specify edges as

From -> To

I have some example graphs in this directory. The graph G1.txt has the main graph from the web writeup cited above (except I've made it 0-indexed):

# This is the initial graph from http://www.ams.org/samplings/feature-column/fcarc-pagerank

0 -> 1
0 -> 2
1 -> 3

2 -> 4
2 -> 1

3 -> 4
3 -> 5
3 -> 1

4 -> 6
4 -> 7
4 -> 5

5 -> 7

6 -> 0
6 -> 4
6 -> 7

7 -> 5
7 -> 6

When we run it, it simply prints out the graph, having created an adjacency list to store each node. It works by having the node names be strings, so that you can have flexible graphs. For example:

UNIX> cat G5.txt
John -> Paul
John -> George
Paul -> Ringo
George -> Ringo
Ringo -> John
UNIX> bin/Pagerank-1-Read-Graph < G5.txt
Node John:
  Edge to: Paul
  Edge to: George
Node Paul:
  Edge to: Ringo
Node George:
  Edge to: Ringo
Node Ringo:
  Edge to: John
UNIX>

Turning the Adjacency List into a Matrix

The next piece of code, src/Pagerank-2-H-Only-Dense.cpp, creates the matrix H from the adjacency matrix. Here, we can make sure we match the web page from the AMS, and we can see the matrix that our John-Paul-George-Ringo graph creates:

UNIX> bin/Pagerank-2-H-Only-Dense < G1.txt
Index    0 - 0.000 0.000 0.000 0.000 0.000 0.000 0.333 0.000 - Node  0
Index    1 - 0.500 0.000 0.500 0.333 0.000 0.000 0.000 0.000 - Node  1
Index    2 - 0.500 0.000 0.000 0.000 0.000 0.000 0.000 0.000 - Node  2
Index    3 - 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 - Node  3
Index    4 - 0.000 0.000 0.500 0.333 0.000 0.000 0.333 0.000 - Node  4
Index    5 - 0.000 0.000 0.000 0.333 0.333 0.000 0.000 0.500 - Node  5
Index    6 - 0.000 0.000 0.000 0.000 0.333 0.000 0.000 0.500 - Node  6
Index    7 - 0.000 0.000 0.000 0.000 0.333 1.000 0.333 0.000 - Node  7
UNIX> bin/Pagerank-2-H-Only-Dense < G5.txt 
Index    0 - 0.000 0.000 0.000 1.000 - Node  John
Index    1 - 0.500 0.000 0.000 0.000 - Node  Paul
Index    2 - 0.500 0.000 0.000 0.000 - Node  George
Index    3 - 0.000 1.000 1.000 0.000 - Node  Ringo
UNIX>

Creating the Eigenvector I From the Matrix

Our next step is to iteratively create the eigenvector I from the matrix. The code is in src/Pagerank-3-Find-I-Dense.cpp, and the important method is Find_I(), which I have shown here. It's a pretty straightforward matrix-vector product that we keep doing until the products' difference is below a certain threshold:

void PageRank::Find_I()
{
  vector <double> New_I;    
  int i, j, done;
  double p;
  double threshold;

  New_I.resize(H.size());
  I.resize(H.size(), 0);
  I[0] = 1;
  threshold = 0.000000001;

  done = 0;
  while (!done) {
 
    for (i = 0; i < I.size(); i++) {   /* Calculate the new values of I */
      p = 0;
      for (j = 0; j < I.size(); j++) p += H[i][j] * I[j];
      New_I[i] = p;
    }
                                            /* See if the new and old values */
    done = 1;                               /* are close enough to quit. */
    for (i = 0; done && i < I.size(); i++) {
      if (I[i] - New_I[i] > threshold || New_I[i] - I[i] > threshold) done = 0;
    }
    I = New_I;           /* Yes, I have been seduced by evil in C++ */
  }
}

At this point, we can double-check ourselves that our calculations match those of the AMS web site:

UNIX> bin/Pagerank-3-Find-I-Dense < G1.txt
0.06000000 1
0.06750000 2
0.03000000 3
0.06750000 4
0.09750000 5
0.20250000 6
0.18000000 7
0.29500000 8
UNIX>

Nice.

Using a Sparse Matrix Representation

One of the important components to PageRank is the fact that it can run fast on billions of web sites. The key realization here is that each web site has an average of, say, 10 links out of it, which means that the H matrix has an enormous number of zeros. So, we can leverage a sparse representation of the matrix, where each row of the matrix is represented by two vectors:

H_Ind[i] for each row i of H, the vector H_Ind[i] is a vector of indices j such that H[i,j] is non-zero.
H_Val[i] for each row i of H, the vector H_Val[i] is a vector of the values of H[i,H_Ind[i,j]].

When the H matrix contains a lot of zeros, this representation saves a lot in terms of both time and space. We use this representation in src/Pagerank-4-Find-I-Sparse.cpp. Here's the new code for Find_I():

void PageRank::Find_I()
{
  vector <double> New_I;    
  int i, j, done, row;
  double p;
  double threshold;

  New_I.resize(H_Ind.size());
  I.resize(H_Ind.size(), 0);
  I[0] = 1;
  threshold = 0.000000001;

  done = 0;
  while (!done) {
 
    for (i = 0; i < I.size(); i++) {   /* Calculate the new values of I */
      p = 0;
      for (j = 0; j < H_Ind[i].size(); j++) {
        row = H_Ind[i][j];
        p += H_Val[i][j] * I[row];
      }
      New_I[i] = p;
    }
                                            /* See if the new and old values */
    done = 1;                               /* are close enough to quit. */
    for (i = 0; done && i < I.size(); i++) {
      if (I[i] - New_I[i] > threshold || New_I[i] - I[i] > threshold) done = 0;
    }
    I = New_I;           /* Yes, I have been seduced by evil in C++ */
  }
}

The next step - dealing with nodes that don't have outgoing edges

As described in the writeup above, if a node does not have any outgoing edges, we say that the client of our PageRank algorithm simply visits a new page at random. So, in our calculation of I, we need to sum up the I values of all nodes that have no outgoing edges, and then divide that probability by the number of nodes. That is our starting point with New_I, corresponding to the cases where we jump randomly to a node when the node that we're currently visiting has no outgoing links.

The code is in src/Pagerank-5-Use-Zero-Cols.cpp. Here's the new code, which is in Find_I():

    p = 0;
    for (i = 0; i < Adj.size(); i++) {
      if (Adj[i].size() == 0) p += I[i];
    }
    p /= (double) I.size();

    New_I.clear();
    New_I.resize(I.size(), p);

When we run in on the simple graph in G2.txt, we get the expected output. Here's the graph.

0 -> 1

And here's the output, which matches expectations.

UNIX> bin/Pagerank-5-Use-Zero-Cols < G2.txt
0.33333333 0
0.66666667 1
UNIX>

The Final Piece of the Puzzle - When We Simply Jump To A Random Link

The last part of PageRank is introducing a factor, α that specifies how often we follow links, and how often (1-α), that we simply go to a random site. The final program is in src/Pagerank-6-Do-Alpha.cpp, and here's the relevant code:

    p = (1 - alpha) / (double) I.size();    /* Scale by alpha */
    for (i = 0; i < I.size(); i++) {
      New_I[i] = (alpha * New_I[i]) + p;
    }

An interesting PageRank Calculation: Duplicate Bridge partnerships

After hacking up PageRank, I wanted to show a PageRank calculation that isn't a simple web-page traversal. Here's what I came up with. As you may or may not know, one of my hobbies is Duplicate Bridge, the card game. The Knoxville Duplicate Bridge group's web site is http://www.knoxbridge.org/, and as you can see from the "Clubs" link, you can play bridge on any day of the week in Knoxville or Maryville. The results of these games are posted online, in the "Results" tab, and these files can be a nice source of data.

In bridge, you play with a partner. These are denoted in the beginning of those results files. For example, if you take a look at The results for January 25, 2016 (the link may be broken by now), you'll see that I played with my wife, and Dr. Vander Zanden played with David Shepler (and yes, Dr. Vander Zanden scored better than we did - a rare occurrence...).

Some people, like me, play with very few partners. Some, like Mr. Shepler, play with a lot of partners. We can consider partnership as edges on a graph -- people are nodes, and if two people are ever partners, we connect them with an edge. PageRank will then give you a measure of the people who have a lot of partnership coverage. The ACBL (the American Contract Bridge League) could well use PageRank to identify players for marketing purposes. Those with higher PageRank values would be more influential in, for example, testing new products than those with lower PageRank values. So, we are going to generate the graphs for this calculation and then do the calculation.

The data I used was composed of all of the bridge games played in Knoxville and Maryville in December, 2015 through January, 2016. I have put the data into the file players.txt, and to protect people's privacy, I have given everyone random names except for me, my wife, Dr. Vander Zanden, and Mr. Shepler. Every partnership has two lines in the file, one with one player first, and another with the second player first. So:

UNIX> grep Plank players.txt
James Plank -> Susan Plank
Susan Plank -> James Plank
UNIX> grep Vander players.txt
Brad Vander Zanden -> David Shepler
David Shepler -> Brad Vander Zanden
UNIX> grep Shepler players.txt
Kayla Tad -> David Shepler
Nathan Ebony -> David Shepler
Caleb Trudge -> David Shepler
Brad Vander Zanden -> David Shepler
Lauren Olga -> David Shepler
Matthew Prep -> David Shepler
David Shepler -> Kayla Tad
David Shepler -> Nathan Ebony
David Shepler -> Caleb Trudge
David Shepler -> Brad Vander Zanden
David Shepler -> Lauren Olga
David Shepler -> Matthew Prep
David Shepler -> Ava Ewe
David Shepler -> Sophie Shuck
Ava Ewe -> David Shepler
Sophie Shuck -> David Shepler
UNIX>

As you can see, in December and January, I only partnered with my wife, and she only partnered with me. Dr. Vander Zanden only partnered with Mr. Shepler, while Mr. Shepler had eight different partners in the two months.

My prediction on PageRank was that my wife and I would have the lowest possible PageRank scores. Dr. Vander Zanden's would be higher, and Mr. Shepler's would be a lot higher. I was wrong:

UNIX> bin/Pagerank-6-Do-Alpha < players.txt | sort -nr | head -n 1
0.00745788 Leah Tissue
UNIX> bin/Pagerank-6-Do-Alpha < players.txt | sort -nr | tail -n 1
0.00069401 Brody Ambition
UNIX> bin/Pagerank-6-Do-Alpha < players.txt | egrep 'Plank|Vander|Shepler'
0.00398250 David Shepler
0.00073052 Brad Vander Zanden
0.00204918 James Plank
0.00204918 Susan Plank
UNIX>

Mr. Shepler's PageRank was indeed a lot higher than ours, but Dr. VZ's was lower. Why? Here's a graph that may help illustrate.

The file E-Bridge-Graph.txt defines this graph, and let's take a look at PageRank on it:

UNIX> bin/Pagerank-6-Do-Alpha < E-Bridge-Graph.txt | sort -nr 
0.13368724 Shepler
0.09090909 Suzy
0.09090909 Dr. P
0.09090909 D
0.09090909 C
0.09090909 B
0.09090909 A
0.08989999 Xavier
0.08972191 Zora
0.08972191 Wanda
0.05151441 Dr. VZ
UNIX>

Very interesting -- All nodes in simple cycles have the same PageRank, which is equal to 1/11 (the number of nodes). However, when you add the extra node only connected to one node in the cycle (the "Dr. VZ" node), everything gets shaken up. The "Shepler" node's PageRank goes up, which makes sense, because there are three ways into it. Dr. VZ node's PageRank is the lowest because there's only one way into it, and Wanda/Zora/Xavier's nodes are in the middle. I'm guessing that Xavier's PageRank is higher than Wanda/Zora's because Wanda/Zora only get 1/3 of Shepler's visits, while Xavier gets 1/2 of Wanda/Zora's.

That makes it easier to understand that Dr. VZ's PageRank is lower than mine.

You'll also note that the sum of Wanda/Xavier/Zora/Shepler/Dr.-VZ's PageRanks is 5/11. In fact, if there are n nodes, then the sum of PageRanks of each node in a connected component with c nodes will be c/n.

Interesting stuff, this.