So.
My first implementation is in src/Pagerank-1-Read-Graph.cpp. All this does is read a graph and print it out. The format of our graphs is really simple -- we simply specify edges as
From -> To |
I have some example graphs in this directory. The graph G1.txt has the main graph from the web writeup cited above (except I've made it 0-indexed):
# This is the initial graph from http://www.ams.org/samplings/feature-column/fcarc-pagerank 0 -> 1 0 -> 2 1 -> 3 2 -> 4 2 -> 1 3 -> 4 3 -> 5 3 -> 1 4 -> 6 4 -> 7 4 -> 5 5 -> 7 6 -> 0 6 -> 4 6 -> 7 7 -> 5 7 -> 6 |
When we run it, it simply prints out the graph, having created an adjacency list to store each node. It works by having the node names be strings, so that you can have flexible graphs. For example:
UNIX> cat G5.txt John -> Paul John -> George Paul -> Ringo George -> Ringo Ringo -> John UNIX> bin/Pagerank-1-Read-Graph < G5.txt Node John: Edge to: Paul Edge to: George Node Paul: Edge to: Ringo Node George: Edge to: Ringo Node Ringo: Edge to: John UNIX>
UNIX> bin/Pagerank-2-H-Only-Dense < G1.txt Index 0 - 0.000 0.000 0.000 0.000 0.000 0.000 0.333 0.000 - Node 0 Index 1 - 0.500 0.000 0.500 0.333 0.000 0.000 0.000 0.000 - Node 1 Index 2 - 0.500 0.000 0.000 0.000 0.000 0.000 0.000 0.000 - Node 2 Index 3 - 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 - Node 3 Index 4 - 0.000 0.000 0.500 0.333 0.000 0.000 0.333 0.000 - Node 4 Index 5 - 0.000 0.000 0.000 0.333 0.333 0.000 0.000 0.500 - Node 5 Index 6 - 0.000 0.000 0.000 0.000 0.333 0.000 0.000 0.500 - Node 6 Index 7 - 0.000 0.000 0.000 0.000 0.333 1.000 0.333 0.000 - Node 7 UNIX> bin/Pagerank-2-H-Only-Dense < G5.txt Index 0 - 0.000 0.000 0.000 1.000 - Node John Index 1 - 0.500 0.000 0.000 0.000 - Node Paul Index 2 - 0.500 0.000 0.000 0.000 - Node George Index 3 - 0.000 1.000 1.000 0.000 - Node Ringo UNIX>
void PageRank::Find_I() { vector <double> New_I; int i, j, done; double p; double threshold; New_I.resize(H.size()); I.resize(H.size(), 0); I[0] = 1; threshold = 0.000000001; done = 0; while (!done) { for (i = 0; i < I.size(); i++) { /* Calculate the new values of I */ p = 0; for (j = 0; j < I.size(); j++) p += H[i][j] * I[j]; New_I[i] = p; } /* See if the new and old values */ done = 1; /* are close enough to quit. */ for (i = 0; done && i < I.size(); i++) { if (I[i] - New_I[i] > threshold || New_I[i] - I[i] > threshold) done = 0; } I = New_I; /* Yes, I have been seduced by evil in C++ */ } } |
At this point, we can double-check ourselves that our calculations match those of the AMS web site:
UNIX> bin/Pagerank-3-Find-I-Dense < G1.txt 0.06000000 1 0.06750000 2 0.03000000 3 0.06750000 4 0.09750000 5 0.20250000 6 0.18000000 7 0.29500000 8 UNIX>Nice.
void PageRank::Find_I() { vector <double> New_I; int i, j, done, row; double p; double threshold; New_I.resize(H_Ind.size()); I.resize(H_Ind.size(), 0); I[0] = 1; threshold = 0.000000001; done = 0; while (!done) { for (i = 0; i < I.size(); i++) { /* Calculate the new values of I */ p = 0; for (j = 0; j < H_Ind[i].size(); j++) { row = H_Ind[i][j]; p += H_Val[i][j] * I[row]; } New_I[i] = p; } /* See if the new and old values */ done = 1; /* are close enough to quit. */ for (i = 0; done && i < I.size(); i++) { if (I[i] - New_I[i] > threshold || New_I[i] - I[i] > threshold) done = 0; } I = New_I; /* Yes, I have been seduced by evil in C++ */ } } |
The code is in src/Pagerank-5-Use-Zero-Cols.cpp. Here's the new code, which is in Find_I():
p = 0; for (i = 0; i < Adj.size(); i++) { if (Adj[i].size() == 0) p += I[i]; } p /= (double) I.size(); New_I.clear(); New_I.resize(I.size(), p); |
When we run in on the simple graph in G2.txt, we get the expected output. Here's the graph.
0 -> 1 |
And here's the output, which matches expectations.
UNIX> bin/Pagerank-5-Use-Zero-Cols < G2.txt 0.33333333 0 0.66666667 1 UNIX>
p = (1 - alpha) / (double) I.size(); /* Scale by alpha */ for (i = 0; i < I.size(); i++) { New_I[i] = (alpha * New_I[i]) + p; } |
In bridge, you play with a partner. These are denoted in the beginning of those results files. For example, if you take a look at The results for January 25, 2016 (the link may be broken by now), you'll see that I played with my wife, and Dr. Vander Zanden played with David Shepler (and yes, Dr. Vander Zanden scored better than we did - a rare occurrence...).
Some people, like me, play with very few partners. Some, like Mr. Shepler, play with a lot of partners. We can consider partnership as edges on a graph -- people are nodes, and if two people are ever partners, we connect them with an edge. PageRank will then give you a measure of the people who have a lot of partnership coverage. The ACBL (the American Contract Bridge League) could well use PageRank to identify players for marketing purposes. Those with higher PageRank values would be more influential in, for example, testing new products than those with lower PageRank values. So, we are going to generate the graphs for this calculation and then do the calculation.
The data I used was composed of all of the bridge games played in Knoxville and Maryville in December, 2015 through January, 2016. I have put the data into the file players.txt, and to protect people's privacy, I have given everyone random names except for me, my wife, Dr. Vander Zanden, and Mr. Shepler. Every partnership has two lines in the file, one with one player first, and another with the second player first. So:
UNIX> grep Plank players.txt James Plank -> Susan Plank Susan Plank -> James Plank UNIX> grep Vander players.txt Brad Vander Zanden -> David Shepler David Shepler -> Brad Vander Zanden UNIX> grep Shepler players.txt Kayla Tad -> David Shepler Nathan Ebony -> David Shepler Caleb Trudge -> David Shepler Brad Vander Zanden -> David Shepler Lauren Olga -> David Shepler Matthew Prep -> David Shepler David Shepler -> Kayla Tad David Shepler -> Nathan Ebony David Shepler -> Caleb Trudge David Shepler -> Brad Vander Zanden David Shepler -> Lauren Olga David Shepler -> Matthew Prep David Shepler -> Ava Ewe David Shepler -> Sophie Shuck Ava Ewe -> David Shepler Sophie Shuck -> David Shepler UNIX>As you can see, in December and January, I only partnered with my wife, and she only partnered with me. Dr. Vander Zanden only partnered with Mr. Shepler, while Mr. Shepler had eight different partners in the two months.
My prediction on PageRank was that my wife and I would have the lowest possible PageRank scores. Dr. Vander Zanden's would be higher, and Mr. Shepler's would be a lot higher. I was wrong:
UNIX> bin/Pagerank-6-Do-Alpha < players.txt | sort -nr | head -n 1 0.00745788 Leah Tissue UNIX> bin/Pagerank-6-Do-Alpha < players.txt | sort -nr | tail -n 1 0.00069401 Brody Ambition UNIX> bin/Pagerank-6-Do-Alpha < players.txt | egrep 'Plank|Vander|Shepler' 0.00398250 David Shepler 0.00073052 Brad Vander Zanden 0.00204918 James Plank 0.00204918 Susan Plank UNIX>Mr. Shepler's PageRank was indeed a lot higher than ours, but Dr. VZ's was lower. Why? Here's a graph that may help illustrate.
The file E-Bridge-Graph.txt defines this graph, and let's take a look at PageRank on it:
UNIX> bin/Pagerank-6-Do-Alpha < E-Bridge-Graph.txt | sort -nr 0.13368724 Shepler 0.09090909 Suzy 0.09090909 Dr. P 0.09090909 D 0.09090909 C 0.09090909 B 0.09090909 A 0.08989999 Xavier 0.08972191 Zora 0.08972191 Wanda 0.05151441 Dr. VZ UNIX>Very interesting -- All nodes in simple cycles have the same PageRank, which is equal to 1/11 (the number of nodes). However, when you add the extra node only connected to one node in the cycle (the "Dr. VZ" node), everything gets shaken up. The "Shepler" node's PageRank goes up, which makes sense, because there are three ways into it. Dr. VZ node's PageRank is the lowest because there's only one way into it, and Wanda/Zora/Xavier's nodes are in the middle. I'm guessing that Xavier's PageRank is higher than Wanda/Zora's because Wanda/Zora only get 1/3 of Shepler's visits, while Xavier gets 1/2 of Wanda/Zora's.
That makes it easier to understand that Dr. VZ's PageRank is lower than mine.
You'll also note that the sum of Wanda/Xavier/Zora/Shepler/Dr.-VZ's PageRanks is 5/11. In fact, if there are n nodes, then the sum of PageRanks of each node in a connected component with c nodes will be c/n.
Interesting stuff, this.