CS302 Lecture Notes - Using upper_bound and lower_bound with a set or map

A feature of sets and maps that often gets overlooked is the ability to search for an item that may or may not be present, and return the item that is "near" it. There are two member functions of both sets and maps that do this:

Are the names counter-intuitive? Yes. Whatever. Let's show an example of a program that uses one. (BTW, if you want more practice, do the 500-point problem from division one of Topcoder SRM 337).

If you took CS140 from me, you may have done the Keno lab. I'll refresh your memory with the preamble to the lab:

We're not redoing the Keno lab, fun as it is. Instead, we have another Keno problem. Casino mogul Ronald Thump (these lecture notes were written well before Mr. Thump ever ran for president...) has contacted you to design a tempting side-bet for his Keno rooms. Of course, he wants it to make him money, but to seem like it makes the gambler money. Here's what you come up with:

The Lucky Loser Bet

  • You choose one number.
  • If that number is one of the 20 picked, then you simply get your money back.
  • However, if it is not one of the 20 picked, then you look at the closest number picked that is greater than yours, and the closest number picked that is less than yours. (Wrap around if there's no number greater or less than yours).
  • If the higher number is closer to yours than the lower number, you win $1.25 on a $1.00 bet.
  • Otherwise, you lose.

Mr. Thump thinks that this is a catchy game. People may well take out "insurance" on their Keno bets by choosing all of their Keno picks as Lucky Losers. At a very high level, it looks like a good bet for the following reason:

If your number is picked, nothing matters. However, if it is not picked, then it would seem as though you have a roughly 50-50 chance of having the closest number picked be higher than yours. You're getting $1.25 instead of $1.00 on what seems to be a 50-50 chance. I'd take those odds!

Now, of course, the odds aren't 50-50. Why? Suppose the numbers 4 and 8 are picked, but 5, 6 and 7 are not. 7 is winning bet, but 6 and 5 are not. In other words, when the interval between picked balls is an odd number i, then there are (i-1)/2 winners and (i+1)/2 losers. Of course, if the interval is an even number, then there are an equal number of winners and losers.

It would take some good math to figure out the closed-form probability of the Lucky Loser bet. However, we can simulate, and that's just as good, at least to Mr. Thump. We'll go through a sequence of programs that illustrate a number of points.


Program #1 - using upper_bound

I'm not going through all of this program. The basic version is in keno-ll.cpp. Let's first look at our class definition for a Keno_LL:

class Keno_LL {
  public:
    int NB;                       // Initial parameter: Number of balls in play (80 in our example)
    int NP;                       // Initial parameter: Number of balls picked each time (20 in our example)
    double Payout;                // This is how much you win in the Lucky Loser bet (1.25 in our example)
    int Iterations;               // Number of iterations for the simulation. 0 if interactive
    int Verbose;                  // Output on each iteration, or just at the end?

    set <int> Picked;             // This is used for the balls picked at each iteration

    int Wins, Losses, Ties;       // Stats - total wins, losses and ties
    double Winnings;              // Total winnings (yes, we could calculate from above)
    double N;                     // Iteration so far

    void Pick_Balls();            // Creates Picked randomly
    void Calculate_Payout(int b); // Given a ball b, and set Picked, calculates the payout and updates the stats.
};

All these are straightforward. At each iteration, we'll call Pick_Balls(), which will put NP random numbers from 1 to NB into Picked. Then we'll get a value of b, either from standard input or randomly, and call Calculate_Payout(), which will determine whether b is a winner, loser or tie, and updates all those stats accordingly. I won't show the code for the main() -- it simply parses command line arguments, sets up an instance of the Keno_LL class, and makes the appropriate calls to Pick_Balls() and Calculate_Payout().

Let's take a look at those methods:

/* Procedure to pick balls randomly.  The balls are put into the
   set "Picked," which is sentinelized so that the first ball is  
   at the end of the set, after the maximum numbered ball, and the  
   last ball is at the beginning of the set, before ball 1. */

void Keno_LL::Pick_Balls()
{
  int i, j, first, last;
  set <int>::iterator pbit;

  Picked.clear();
  
  for (i = 0; i < NP; i++) {
    do j = random()%NB+1; while (Picked.find(j) != Picked.end());
    Picked.insert(j);
  }

  if (Verbose) {
    cout << "Balls Picked:";
    for (pbit = Picked.begin(); pbit != Picked.end(); pbit++) {
      cout << " " << *pbit;
    }
    cout << ".\n";
  }

  first = *(Picked.begin());       /* Sentinelize Picked */
  last = *(Picked.rbegin());
  Picked.insert(NB+first);
  Picked.insert(last-NB);
}

Pick_Balls() puts random numbers into a set. It takes care not to put duplicates into the set. If specified, it prints out the balls picked. At the end, it sentinelizes the set. If the smallest element is s and the largest l, then it inserts (l-NB) and (NB+s) into the map. This is to handle the case when the ball that the contestant picks is lower than all the picked balls, or higher than all the picked balls. Then, you don't have to have any special case code in Calculate_Payout().

Here's Calculate_Payout():

void Keno_LL::Calculate_Payout(int b)
{
  set <int>::iterator pbit;
  int u, l;
  double win;
  
  /* Determine whether b is picked (a tie), a winner or a loser. */

  pbit = Picked.lower_bound(b);
  if (*pbit == b) {
    win = 0;
    Ties++;
  } else {
    u = *pbit;
    pbit--;
    l = *pbit;
    if (u - b < b - l) {
      win = Payout;
      Wins++;
    } else {
      win = -1;
      Losses++;
    }
  }
 
  /* Update stats, and print out what happened, if desired. */

  Winnings += win;
  N++;

  if (Verbose) {
    if (win == 0) {
      printf("  Your ball was picked. +0: ");
    } else {
      printf("  D to higher: %d.  D to lower: %d.  %+.2lf: ",
        u-b, b-l, win);
    }
    printf("Total = %.2lf.  Avg = %.6lf\n", Winnings, Winnings/N);
  }
}

I use the lower_bound method to find the smallest element in Picked that is greater than or equal to b. Since we sentinelized Picked, we're guaranteed that there will be an element greater than b and an element less than b. That's nice, because we don't have to test whether pit is equal to Picked.end() or Picked.begin().

If we didn't find the element, then we find the balls greater than and less than b with "u = *pbit; pbit--; l = *pbit;". This is because you're allowed to increment and decrement iterators to move around the set.

The rest of the code is straightforward.


A little more detail on the sentinels in Picked

I received a lot of questions in class about the sentinels. Let me give a concrete example. Suppose the balls picked are as follows:

2 3 5 12 13 26 31 35 36 38 44 45 51 54 60 65 67 68 70 76

And suppose that we don't use a sentinel in Picked. Let me give three different examples of numbers that the user chooses. Suppose that the user chooses 37. We'll look that up in Picked using Picked.lower_bound(). It will return an iterator to 38, because 38 is the smallest value greater than or equal to 37. We can decrement the iterator to find the greatest value less than 37 -- that is 36. Thus, we can determine that the user's pick was a loser quite easily.

In example 2, suppose that the user chooses 1. When we call Picked.lower_bound(), it will return an iterator to 2. Since 2 is the smallest value in the set, we'll need to write special-purpose code to calculate that 76 is the "greatest value less than" 1, and that its distance from 1 is 5. That's a drag.

In example 3, suppose that the user chooses 77. Now, Picked.lower_bound() is going to return Picked.end(), and we have to write more special-purpose code to determine that 76 is the lower value and 2 is the "higher" value.

We use the sentinels to avoid writing all of that special-purpose code. We insert 82 = 2+80 and -4 = 76-80 into the set, which now becomes:

-4 2 3 5 12 13 26 31 35 36 38 44 45 51 54 60 65 67 68 70 76 82

We are now guaranteed of two things when we call Picked.lower_bound():

  1. It will never return Picked.end(). In other words it will always return an iterator to an element that equals the user's choice, or that is the smallest element greater than the user's choice.
  2. It will never return the first element of Picked. Which means that when you decrement it to find the greatest element smaller than the user's choice, it will point to a valid value.
And we don't have to write any special purpose code. Judicious use of sentinels can make your code very clean. Remember our implementation of doubly-linked lists in CS140? Without that header node, the implementation would be really messy. The sentinel (the header node) made for clean and efficient code.
Let's run it. The parameters of the program are NB, NP, Payout, Iterations and Verbose:
UNIX> g++ -o keno-ll keno-ll.cpp
UNIX> keno-ll
usage: keno-ll #balls #picked payout iterations-(zero-to-play) verbose(y|n)
UNIX> keno-ll 80 20 1.25 0 y
Pick your ball: 8
Balls Picked: 2 3 5 12 13 26 31 35 36 38 44 45 51 54 60 65 67 68 70 76.
  D to higher: 4.  D to lower: 3.  -1.00: Total = -1.00.  Avg = -1.000000
Pick your ball: 8
Balls Picked: 1 2 4 7 8 20 31 35 36 37 39 48 51 58 65 67 70 73 75 80.
  Your ball was picked. +0: Total = -1.00.  Avg = -0.500000
Pick your ball: 8
Balls Picked: 3 12 15 26 32 33 35 36 39 40 49 50 54 59 65 66 69 70 72 73.
  D to higher: 4.  D to lower: 5.  +1.25: Total = 0.25.  Avg = 0.083333
Pick your ball: 8
Balls Picked: 1 7 8 12 16 22 28 30 35 41 45 46 51 52 62 64 69 71 73 75.
  Your ball was picked. +0: Total = 0.25.  Avg = 0.062500
Pick your ball: 8
Balls Picked: 5 11 15 17 20 21 22 27 29 32 35 43 47 50 56 57 61 66 67 68.
  D to higher: 3.  D to lower: 3.  -1.00: Total = -0.75.  Avg = -0.150000
UNIX> keno-ll 80 20 1.25 0 n
8
8
8
8
8
Total = 4.00.  Avg = 0.800000.  W/L/T: 4 1 0
UNIX> keno-ll 80 20 1.25 5 y
Picked 59
Balls Picked: 5 8 9 14 15 19 23 24 33 45 54 61 66 69 70 71 75 78 79 80.
  D to higher: 2.  D to lower: 5.  +1.25: Total = 1.25.  Avg = 1.250000
Picked 58
Balls Picked: 1 3 7 10 13 17 31 35 40 42 46 49 57 59 65 68 70 71 75 80.
  D to higher: 1.  D to lower: 1.  -1.00: Total = 0.25.  Avg = 0.125000
Picked 34
Balls Picked: 6 7 8 15 22 23 26 35 39 41 49 55 56 59 64 69 72 75 76 77.
  D to higher: 1.  D to lower: 8.  +1.25: Total = 1.50.  Avg = 0.500000
Picked 7
Balls Picked: 1 7 11 13 24 26 29 33 34 37 46 50 52 54 58 64 66 67 75 79.
  Your ball was picked. +0: Total = 1.50.  Avg = 0.375000
Picked 68
Balls Picked: 4 6 8 10 12 20 23 26 28 31 33 38 43 49 58 67 71 74 76 77.
  D to higher: 3.  D to lower: 1.  -1.00: Total = 0.50.  Avg = 0.100000
UNIX> keno-ll 80 20 1.25 5 n
Total = -2.00.  Avg = -0.400000.  W/L/T: 0 2 3
UNIX> 
If we choose a large number of iterations, we can start to see whether this is a good or bad bet over the long run. I use time to show how long each run takes (on my Macintosh):
UNIX> time keno-ll 80 20 1.25 10  n
Total = 3.00.  Avg = 0.300000.  W/L/T: 4 2 4
0.000u 0.000s 0:00.00 0.0%	0+0k 0+1io 0pf+0w
UNIX> time keno-ll 80 20 1.25 100 n
Total = -21.00.  Avg = -0.210000.  W/L/T: 24 51 25
0.003u 0.001s 0:00.00 0.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll 80 20 1.25 1000 n
Total = -31.25.  Avg = -0.031250.  W/L/T: 315 425 260
0.022u 0.001s 0:00.02 100.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll 80 20 1.25 10000 n
Total = -566.75.  Avg = -0.056675.  W/L/T: 3069 4403 2528
0.170u 0.001s 0:00.17 100.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll 80 20 1.25 100000 n
Total = -3068.00.  Avg = -0.030680.  W/L/T: 32096 43188 24716
1.637u 0.001s 0:01.63 100.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll 80 20 1.25 1000000 n
Total = -26723.75.  Avg = -0.026724.  W/L/T: 321273 428315 250412
16.364u 0.009s 0:16.37 99.9%	0+0k 0+0io 0pf+0w
UNIX> 
Well, it appears to be converging slightly, but man, that's slow. First, let's use the optimizer -- that usually speeds things up. There are four levels of optimization, and usually the -O3 flag gives you the best bang for the buck:
UNIX> g++ -o keno-ll -O keno-ll.cpp
UNIX> time keno-ll 80 20 1.25 1000000 n
Total = -28030.00.  Avg = -0.028030.  W/L/T: 321128 429440 249432
6.415u 0.001s 0:06.41 100.0%	0+0k 0+0io 0pf+0w
UNIX> g++ -o keno-ll -O2 keno-ll.cpp
UNIX> time keno-ll 80 20 1.25 1000000 n
Total = -27554.00.  Avg = -0.027554.  W/L/T: 321096 428924 249980
6.185u 0.003s 0:06.18 100.0%	0+0k 0+0io 0pf+0w
UNIX> g++ -o keno-ll -O3 keno-ll.cpp
UNIX> time keno-ll 80 20 1.25 1000000 n
Total = -27053.75.  Avg = -0.027054.  W/L/T: 321521 428955 249524
6.150u 0.003s 0:06.15 100.0%	0+0k 0+0io 0pf+0w
UNIX> g++ -o keno-ll -O4 keno-ll.cpp
UNIX> time keno-ll 80 20 1.25 1000000 n
Total = -27600.75.  Avg = -0.027601.  W/L/T: 320969 428812 250219
6.150u 0.002s 0:06.15 100.0%	0+0k 0+0io 0pf+0w
UNIX> 
There are definitely a few places to work on speeding up. First is Pick_Balls(). That's a poor way to choose random numbers, since you may have to throw numbers away when you choose duplicates. Think about it -- what if NP is really close to NB? Then when you get to the last balls, you're more likely to have to throw away a pick than not.

A better way is to put all of the numbers from 1 to 80 into an array, and then randomly pull them out. Each time you "pull a number out", you move it to the end of the array, and then don't consider it for the next pick.

The new code is in keno-ll2.cpp. I've added a vector Balls to the Keno_LL class, and I have initialized it to hold the numbers 1 thorugh NB. Then, Pick_Balls() works as follows:

void Keno_LL::Pick_Balls()
{
  int i, j, first, last, tmp;
  set <int>::iterator pbit;

  Picked.clear();
  
  for (i = 0; i < NP; i++) {
    j = random()%(NB-i);
    tmp = Balls[j];
    Balls[j] = Balls[NB-i-1];
    Balls[NB-i-1] = tmp;
    Picked.insert(Balls[NB-i-1]);
  }

  ...

It runs faster, but not by a huge amount (9 percent):

UNIX> g++ -o keno-ll2 -O3 keno-ll2.cpp
UNIX> time keno-ll2 80 20 1.25 1000000 n
Total = -27893.75.  Avg = -0.027894.  W/L/T: 321237 429440 249323
5.652u 0.005s 0:05.65 100.0%	0+0k 0+0io 0pf+0w
UNIX> 
Can we do better? Well, consider this -- instead of trying one random ball on each iteration, let's just look at all of them. We'll do this in a method called Calculate_All(), which we only call when we're doing multiple iterations. Here is a very simple implementation, which simply calls Calculate_Pick() for all balls (in keno-ll3.cpp):

void Keno_LL::Calculate_All()
{
  int b;

  for (b = 1; b <= NB; b++) Calculate_Payout(b);
}

We'll compile with optimization and run it:

UNIX> g++ -O3 -o keno-ll3 keno-ll3.cpp
UNIX> time keno-ll3 80 20 1.25 1000000 n
Total = -27149.25.  Avg = -0.027149.  W/L/T: 321267 428733 250000
0.108u 0.001s 0:00.10 100.0%	0+0k 0+0io 0pf+0w
UNIX> 
Dang, that was fast! It's because we're iterating 1000000/80 times instead of 1000000. Granted, we're doing a little more work at each iteration, but not much.

We can make it faster still, though. We don't really need to calculate the picks for every ball. Instead, we can just run through the map and check each interval. If the interval size is x, then there will be x/2 winners (integer division) and x - x/2 losers.

The code is in keno-ll4.cpp:

void Keno_LL::Calculate_All()
{
  set <int>::iterator low, high;
  int x, highest;
  int nw, nl;

  highest = *Picked.rbegin();
  low = Picked.begin();
  high = low;
  high++;
  nw = 0;
  nl = 0;

  while (*high != highest) {
    x = *high - *low - 1;
    nw += (x/2);
    nl += (x - x/2);
    low++;
    high++;
  }
  Wins += nw;
  Losses += nl;
  Ties += NP;
  Winnings += nw*Payout;
  Winnings -= nl;
  N += NB;
}

Note once again how the sentinels help us keep our code clean. It speeds matters up even further:

UNIX> g++ -O4 -o keno-ll4 keno-ll4.cpp
UNIX> time keno-ll4 80 20 1.25 1000000 n
Total = -27682.50.  Avg = -0.027682.  W/L/T: 321030 428970 250000
0.076u 0.000s 0:00.07 100.0%	0+0k 0+0io 0pf+0w
UNIX> 
0.076 vs 0.102 may not seem like much, but it is over 25 percent.

I ended here in class, but I mentioned that you can do this even faster. How? Get rid of the map. Instead, simply put the picked balls into an array and sort the array with the STL procedure sort(). Then you traverse the array in Calculate_All() instead of the map.

The code is in keno-ll5.cpp, where we write a separate Pick_Balls_Array() which puts the balls into an array PBalls:

void Keno_LL::Pick_Balls_Array()
{
  int i, j, first, last, tmp;

  for (i = 0; i < NP; i++) {
    j = random()%(NB-i);
    tmp = Balls[j];
    Balls[j] = Balls[NB-i-1];
    Balls[NB-i-1] = tmp;
    PBalls[i] = Balls[NB-i-1];
  }
  PBalls[i] = NB+1;   /* Make room for the sentinel at the end */
  sort(PBalls.begin(), PBalls.end());
  PBalls[NP] = NB+PBalls[0];  /* Put a sentinel at the end */

  if (Verbose) {
    cout << "Balls Picked:";
    for (i = 0; i < NP; i++) cout << " " << PBalls[i];
    cout << ".\n";
  }

}

void Keno_LL::Calculate_All()
{
  int i, x;
  int nw, nl;

  nw = 0;
  nl = 0;

  for (i = 0; i < NP; i++) {
    x = PBalls[i+1] - PBalls[i] - 1;
    nw += (x/2);
    nl += (x - x/2);
  }
  Wins += nw;
  Losses += nl;
  Ties += NP;
  Winnings += nw*Payout;
  Winnings -= nl;
  N += NB;
}

Note how the array is again sentinelized. The array PBalls has NP+1 elements. After putting the random balls in unsorted order into PBalls, we set PBalls[NP] to equal NB+1, and then we sort it. That way, we know that PBalls[NP] will remain equal to NB+1. After the sort, we know that the minimum element is in PBalls[0], so we can put PBalls[0]+NB into the last PBalls[NP]. Now we have all the intervals represented easily in PBalls.

When we run it, it's much, much faster -- this is the best code for the problem:

UNIX> g++ -O3 -o keno-ll5 keno-ll5.cpp
UNIX> time keno-ll5 80 20 1.25 1000000 n
Total = -28366.50.  Avg = -0.028366.  W/L/T: 320726 429274 250000
0.015u 0.001s 0:00.01 100.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll5 80 20 1.25 10000000 n
Total = -278195.25.  Avg = -0.027820.  W/L/T: 3209691 4290309 2500000
0.114u 0.001s 0:00.11 100.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll5 80 20 1.25 100000000 n
Total = -2778566.25.  Avg = -0.027786.  W/L/T: 32098415 42901585 25000000
1.103u 0.001s 0:01.10 100.0%	0+0k 0+0io 0pf+0w
UNIX> time keno-ll5 80 20 1.25 1000000000 n
Total = -27759207.00.  Avg = -0.027759.  W/L/T: 320995908 429004092 250000000
10.999u 0.004s 0:11.00 99.9%	0+0k 0+0io 0pf+0w
UNIX> 
We can confidently tell Mr. Thump that this bet will make him 2.78 cents on every dollar bet. My guess is that he'd like a little more. If you make the payout $1.20 instead of $1.25, his profit goes to 4.38 cents. How does that compare? Well, Roulette is a profit of 5 cents. Three card poker goes anywhere from 1.96 cents to about 10, depending on the odds. I'd say you've invented a pretty good game!

What have we learned?

Well, my goals here were twofold. First, I wanted to teach you upper_bound and lower_bound. Second, I wanted to show you how the choice of algorithm and data structure impacts performance. While the code in keno-ll.cpp was reasonable to solve the problem, it had three performance deficiencies:

  1. The random ball-picking algorithm was inefficient.
  2. The code for multiple iterations had too much randomness -- calculating the probabilities for all 80 balls at each iteration is a much smarter and faster way to perform the evalution.
  3. Performing insertions/finds with a set is slower than using an vector and sorting. The code for Pick_Balls() and Calculate_All() is O(NP log(NP)) in both cases. However, the vector version is faster, largely because it is more memory- efficient.

As a final thought exercise, would you have to change your implementation if NB is huge (say, 500,000,000) and NP is much smaller (say, 1000)? Think about it, and if you want your thought process corroborated, ask me in class.