CS494 Lecture Notes - Bayesian Optimization


THESE LECTURE NOTES ARE NOT COMPLETE OR FINISHED

In particular, I have been disappointed by "BayesOpt", and have switched to Scikit Optimize. I'll update the lecture notes.

I am not an expert in this

Just a caveat: I am not an expert in this area. However, much like a lot of optimization (e.g. Deep Learning) you don't need to be an expert in this area in order to leverage its results, and you should have it in the back of your mind as a solution to problems that may arise in your life.

Reference Material

Wikipedia does a far better job than I in explaining Bayesian Optimization: https://en.wikipedia.org/wiki/Bayesian_optimization. I use the "BayesOpt" toolkit, which has a very detailed page explaining theory: http://rmcantin.github.io/bayesopt/html/bopttheory.html. The author of BayesOpt has asked that if you use the toolkit, you cite his journal paper:

Ruben Martinez-Cantin, "BayesOpt: A Bayesian Optimization Library for Nonlinear Optimization, Experimental Design and Bandits." Journal of Machine Learning Research, 15 (Nov): pp. 3735-3739, 2014.

My hatred of python precludes me from exploring the options in python, but there are many. Scikit-optimize is one that my colleagues have used. This description of Bayesian Optimization is not only a good one, but it also has some Jupyter notebooks that you can play along with, if that's your jam..

The problem being solved

You are trying to minimize a function. You can evaluate this function, and the evalution may be expensive, but you don't know enough about it to use better optimization techniques (e.g. Calculus, Newton's Method, or Gradient Descent). What can you do? Well, if the parameters to the function are limited enough, you can enumerate them. Most parameter spaces are not that limited, though. So, the brain-dead approach that most people take is to do a Monte-Carlo grid search: Repeatedly choose random parameters and evaluate the function. Then, when you find some parameters that minimize the function, hone the grid search so that you are searching parameters close to the ones that minimized the function. Repeat and pray.

Bayesian optimization tries to improve this approach by shoving some math behind it. I'll summarize it as follows (go ahead and read the Wikipedia page if you're interested in learning more): the optimizer considers the data that you have so far as being from a known function. Since the function is known, the optimizer can tell you what parameters will minimize it. It tells you those parameters, and you evaluate them with your function. It takes that answer, and refines its function.

I won't steal pictures from other web sites -- just look at the ones above.

Two Main Decision Points

There are two main decisions points when you're using Bayesian Optimization. The first is what function that you use as the approximation. The most popular one is called a "Gaussian Process." To quote wikipedia's web page: "In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed."

The second is the selection algorithm: how do you choose the next set of parameters to test? The popular one here is called "Expected Improvement". With the selection algorithm, you would like a blend of "exploration/exploitation", meaning that the algorithm should try to refine the parameters that it already knows are good (exploitation), but also generate parameters that may be bad, but are from unknown regions of the parameter space, that might be a lot better than the ones we know (exploration). Typically, there are parameters that you can set to affect this tradeoff.


Demonstration without getting too mathematical

I'm going to use the BayesOpt library, which is older, but also (kind of) manageable and in C++. You can grab it from Github at https://github.com/rmcantin/bayesopt. In these lecture notes, I assume that it has been cloned into the directory $HOME/src/repos/bayesopt. If you want to play along, clone the repo, perform the compilation and fix the variable in the makefile. For these examples, I use the "C" version (even though I'm in C++ -- it's pretty well documented how much I hate inheritance, and I can go on the record also as hating Boost, so I don't use their "C++" version). That only requires that you use the -I flag to include their include directory, and that you link with lib/libbayesopt.a and lib/libnlopt.a. I like that simplicity.

For my demonstration, my goal is to present a function that is intuitive, clearly non-linear, but not mathematically complicated. Here it is. My function is defined by a set of points, P, in a bounded n-dimensional coordinate system. The coordinate system is bounded in that all points have coordinates that are between 0 and 10 (inclusive). The function takes n points(x0, x0, ..., xn-1), and it returns the shortest distance to any point in P. Our goal is to maximize this function for all points whose values are between 1 and 9 (I'd do 0 and 10, but that tends to have maxima that are on the border).

It's easy to concoct a scenario where this function is important: There are a bunch of pollution sources in a city, and you want to build a house that is the farthest from a pollution source. You can also solve this problem by better means, but since we're focusing on optimization, we are treating the function as a black box.

I have a quick and dirty implementation here, in include/points.hpp and src/points.cpp. I'll only show the header. This is a pretty simple data structure:

#pragma once 

#include <vector>
#include <istream>

/* This is a data structure to help demonstrate Bayesian Optimization.  See the 
   accompanying lecture notes. */

class Points {
  public:

    /* The only piece of data is a vector of points.  Each element of the vector is 
       a vector of n points, and each point will be a double between 0 and 10.  I'm 
       making it public, because I'm not in the mood to get all object oriented 
       about this. */

    std::vector < std::vector <double> > p;
  
    /* You'll note -- no constructors, destructors or assignment overloads.  Since the
       only piece of data is that vector of vectors, all the defaults will work. */

    /* Reads from a stream and double-checks that everything is correct.  If it
       encounters an error, it will throw an exception of type std::runtim_error. */

    void Read(std::istream &in);

    /* Here is a function that we want to maximize.  Given a point in this
       n-dimensional space, what is the smallest distance to a point in p? 
       If x's size is not equal to the number of dimensions of p, it will throw
       an exception of type std::runtim_error. */
       
    double Min_Distance(const std::vector <double> &x) const;
};

I've got a really simple program in src/points_stdin.cpp, which takes a file of points as a command line argument, and then single points on standard input. For each point on standard input, it prints the minimum distance to a point in the file. I'll show a few files, because that will help demonstrate the optimization later. The first is txt/1d_10.txt, which is a file containing ten random points in 1D space:

UNIX> cat txt/1d_10.txt
6.6224292 
9.4497210 
8.9526480 
9.0860032 
5.9090157 
1.7412893 
3.3581418 
0.3891034 
8.7371743 
9.4515564 
UNIX> 
I have graphed these points below, and I show the minimum distance of each of integer points to one of these points in the graph:

You can eyeball these points and do the optimization problem yourself -- it's clearly the midpoint between 3.3581418 and 5.9090157: 4.633579, whose distance is 1.275437 from either point. Let's verify these things with bin/points_stdin:

UNIX> sh -c '(for i in 1 2 3 4 5 6 7 8 9 4.633579 ; do echo $i; done ) | bin/points_stdin txt/1d_10.txt'
0.610897
0.258711
0.358142
0.641858
0.909016
0.0909843
0.377571
0.737174
0.047352
1.27544
UNIX> 
Here are five more data files: So you have a visual idea of the 2D data sets, here are some heatmaps, where red means greater values of Min_Distance and white means smaller values. (Of course, these were created with jgraph using the Points class -- the source is src/points_2d_jgraph.cpp):

2d_10.txt

2d_100.txt

2d_1000.txt


Grid Search

Since it was simple, I wrote a brute-force grid search, where you give the program a granularity, and it enumerates all points with that granularity. The enumeration loop is a fun one, if you've never written one like this (or more amazingly, if you've never seen me use break in a C++ program). It's in src/points_brute.cpp:

  /* Run through all of the points and find the one that optimizes the function. */

  max = 0;
  x.resize(points.p[0].size(), 0);
  done = false;

  while (!done) {
    d = points.Min_Distance(x);
    if (d > max) {
      max = d;
      best = x;
    }
    for (i = 0; i < x.size(); i++) {
      x[i] += granularity;
      if (x[i] <= 10) break;
      x[i] = 0;
    }
 
    done = (i == x.size());
  }

Now we can estimate the maximum point for each of the data sets. The amount of time, and the precision of the search depend on the data set. That tiny 1D dataset takes a blink of an eye, and the 2D data sets don't take long:

Dataset Granularity Time Maximum Points
1d_10.txt 10-7 2.3 s 1.27544 [4.63358]
2d_10.txt 10-1 0.004 s 3.36888 [7.6, 3.8]
2d_10.txt 10-2 0.024 s 3.39722 [7.64, 3.79]
2d_10.txt 10-3 1.36 s 3.40008 [7.646, 3.783]
2d_10.txt 10-4 2m, 18s 3.40047 [7.6461, 3.7835]
2d_100.txt 10-2 0.141 s 1.48968 [5.51, 5.84]
2d_100.txt 10-3 12.2 s 1.49261 [5.513, 5.837]
2d_1000.txt 10-2 1.22 s 0.550313 [7.30, 4.56]
2d_1000.txt 10-3 2m 4s 0.551108 [7.295 4.561]

For the 3D data set, it starts to take a long time, even for relatively high granularities. Moreover, the results don't simply refine the points as they did above:

Dataset Granularity Time Maximum Points
3d_1000.txt 0.10 1.28 s 1.37988 [6.40, 2.40, 7.30]
3d_1000.txt 0.09 1.64 s 1.39828 [6.22, 2.44, 7.39]
3d_1000.txt 0.08 2.30 s 1.39613 [6.60, 2.34, 7.24]

I've colored those values in red, because they are troublesome -- first, you'll note that two of the the points for 0.8 are more than 0.1 units away from the points for 0.9. That's what I mean by the lower granularities not "refining" the higher granularities. Worse, the maximum for a granularity of 0.8 is lower than the granularity for 0.9. Here are the maxima and timings for all granularities from 0.10 down to 0.01:

That 9D data set is a true disaster. Using a granularity of one takes 37.5 minutes!

UNIX> time bin/points_brute 1 < txt/9d_1000.txt
Best x: 1.00 1.00 9.00 9.00 1.00 4.00 1.00 1.00 1.00
Max: 8.25241

real	37m27.778s
user	37m25.628s
sys	0m1.027s
UNIX>

A Monte Carlo Search

Monte Carlo searches are nothing exciting -- just choose random points and keep track of the maximum. The code for this is in src/points_monte.cpp. I don't even show any of it, because it's so simple. While Monte Carlo searches are good for quick and dirty tests, and while they are very easy to parallelize, they are usually best used as starting points for figuring out how to do a search smarter. Below are some Monte Carlo runs for the 3D data set -- the results aren't much different from the grid search, except that it is easier to scale these tests in time.

In the 9D data set, I set the parameters so that the Monte Carlo search would take about 37 minutes. It didn't come close to the grid search:

UNIX> time bin/points_monte  372000000 1 < txt/9d_1000.txt 
Best x: 5.87737 8.88675 1.33628 1.07946 1.59121 8.88738 1.28253 8.66066 2.0708
Max: 7.7405

real	36m47.246s
user	36m44.227s
sys	0m1.355s
UNIX> 

Using Bayesopt

Getting our problem to use BayeSopt is straightforward. It is in src/points_bayes.cpp. First, I need to implement a function with the appropriate prototype for BayesOpt to minimize. Here it is:

bool Print = false;      /* If true, then min_function() will print what it's doing. */

/* To use BayesOpt, you need to define a testing function with their prototype: */

double min_function(unsigned int n,      // Number of dimensions
                    const double *x,     // The point to test - an array of n dobules
                    double *gradient,    // Just set this to NULL.
                    void *func_data)     // You use this to pass data to the function. 
                                         // In our case, it will be a pointer to the Points class
{
  Points *p;
  vector <double> xv;
  unsigned int i;
  double rv;

  gradient = NULL;                        // We don't use this, so set it to NULL

  /* My code simply converts the data from their format into a Min_Distance() call. */

  p = (Points *) func_data;
  for (i = 0; i < n; i++) xv.push_back(x[i]);
  rv = p->Min_Distance(xv);
  if (Print) {
    for (i = 0; i < n; i++) printf(" %lg", x[i]);
    printf("  %lg\n", rv);
  }
  return -rv;      /* I return the negation, because the optimizer does minimization. */
}

Then, I have to set up their parameters and make a call to the procedure bayes_optimization().

int main(int argc, char **argv)
{
  Points points;
  double *lower, *upper, *x, *minf;
  long long seed;
  long long iterations;
  bopt_params params;
  size_t i;

  /* Skipped code that processes parameters and reads in the points. */

  .......

  /* Set the parameters of the Bayesian Optimizer */

  params = initialize_parameters_to_default();
  params.n_iterations = iterations;
  params.random_seed = seed;
  params.verbose_level = 0;
  lower = (double *) malloc(sizeof(double) * points.p[0].size());
  upper = (double *) malloc(sizeof(double) * points.p[0].size());
  x = (double *) malloc(sizeof(double) * points.p[0].size());
  minf = (double *) malloc(sizeof(double) * points.p[0].size());

  for (i = 0; i < points.p[0].size(); i++) {
    lower[i] = 1.0;
    upper[i] = 9.0;
  }

  bayes_optimization((int) points.p[0].size(),
                     min_function,
                     (void *) &points,
                     lower,
                     upper,
                     x,
                     minf,
                     params);

  printf("Best x:");
  for (i = 0; i < points.p[0].size(); i++) printf(" %lg", x[i]); 
  printf("\n");
  printf("Max: %lg\n", -minf[0]);
  return 0;
}

To compile, you need to specify BayesOpt's include directory and compiled libraries:

UNIX> make bin/points_bayes
g++ -c -o obj/points_bayes.o -O3 -Wall -Wextra -I/Users/plank/src/repos/bayesopt/include -Iinclude src/points_bayes.cpp
g++ -o bin/points_bayes obj/points.o obj/points_bayes.o /Users/plank/src/repos/bayesopt/lib/libbayesopt.a /Users/plank/src/repos/bayesopt/lib/libnlopt.a -lm
UNIX> 
My main() allows you to set the number of iterations, the RNG seed, and whether or not to print each call to min_fitness:
UNIX> bin/points_bayes
usage: points_bayes iterations seed print(t/f) - points on stdin
UNIX> 
The default number of iterations that BayesOpt uses is 190, so let's try that here on the 2D data sets:
UNIX> time bin/points_bayes 190 1 f < txt/2d_10.txt
Best x: 7.64349 3.78387
Max: 3.39845                   # Our grid search yielded 3.40047 in two minutes.

real	0m1.140s
user	0m1.131s
sys	0m0.007s
UNIX> time bin/points_bayes 190 1 f < txt/2d_100.txt
Best x: 5.53086 5.84168
Max: 1.48764                   # Our grid search yielded 1.49261 in 12 seconds

real	0m2.514s
user	0m2.504s
sys	0m0.008s
UNIX> time bin/points_bayes 190 1 f < txt/2d_1000.txt
Best x: 3.44244 8.84877
Max: 0.458114                  # Our grid search yielded 0.551108 in two minutes.

real	0m3.188s
user	0m3.177s
sys	0m0.008s
UNIX> 
That last result looks pretty disappointing. Similarly, I get 1.3295 with 3d_1000.txt, as opposed to over 1.40 with the grid search. I suspect that the optimization is being caught in the large number of local minima that arise with all of the points in space. Delving into the documentation, I see that there are some variables that I can tweak. That sure feels like a rabbit hole, but one parameter that has some promise is n_init_iterations. This is the number of initial calls with fixed values so that the optimizer can start searching. The default is 10 -- let's bump it up to 100 and see if getting some more variety when seeding the search yields more success (I'm bumping the total iterations up as well):
UNIX> bin/points_bayes_2 290 1 100 f < txt/2d_1000.txt 
Best x: 7.31019 4.56059
Max: 0.548909
UNIX> bin/points_bayes_2 290 1 100 f < txt/3d_1000.txt 
Best x: 6.61776 2.36002 7.21627
Max: 1.40403
UNIX> bin/points_bayes_2 290 1 100 f < txt/9d_1000.txt 
Best x: 8.99409 1.00029 1.01946 1.00492 7.79535 1.00081 1 1.0075 8.99393
Max: 7.73639
UNIX>
That's much better, isn't it? Moreover, that last call took 1.5 minutes to yield a result similar to the 37 minute Monte Carlo run. And it got the result with 290 calls to min_fitness(), as opposed to 372 million calls by the Monte Carlo run.

Obviously, there are a lot of things to explore with this type of optimization, and I'm not going to probe any more with this set of lecture notes. My goal was to introduce you to the concept, so that you can draw upon it if you need to later in your lives.