| Ruben Martinez-Cantin, "BayesOpt: A Bayesian Optimization Library for Nonlinear Optimization, Experimental Design and Bandits." Journal of Machine Learning Research, 15 (Nov): pp. 3735-3739, 2014. |
My hatred of python precludes me from exploring the options in python, but there are many. Scikit-optimize is one that my colleagues have used. This description of Bayesian Optimization is not only a good one, but it also has some Jupyter notebooks that you can play along with, if that's your jam..
Bayesian optimization tries to improve this approach by shoving some math behind it. I'll summarize it as follows (go ahead and read the Wikipedia page if you're interested in learning more): the optimizer considers the data that you have so far as being from a known function. Since the function is known, the optimizer can tell you what parameters will minimize it. It tells you those parameters, and you evaluate them with your function. It takes that answer, and refines its function.
I won't steal pictures from other web sites -- just look at the ones above.
The second is the selection algorithm: how do you choose the next set of parameters to test? The popular one here is called "Expected Improvement". With the selection algorithm, you would like a blend of "exploration/exploitation", meaning that the algorithm should try to refine the parameters that it already knows are good (exploitation), but also generate parameters that may be bad, but are from unknown regions of the parameter space, that might be a lot better than the ones we know (exploration). Typically, there are parameters that you can set to affect this tradeoff.
For my demonstration, my goal is to present a function that is intuitive, clearly non-linear, but not mathematically complicated. Here it is. My function is defined by a set of points, P, in a bounded n-dimensional coordinate system. The coordinate system is bounded in that all points have coordinates that are between 0 and 10 (inclusive). The function takes n points(x0, x0, ..., xn-1), and it returns the shortest distance to any point in P. Our goal is to maximize this function for all points whose values are between 1 and 9 (I'd do 0 and 10, but that tends to have maxima that are on the border).
It's easy to concoct a scenario where this function is important: There are a bunch of pollution sources in a city, and you want to build a house that is the farthest from a pollution source. You can also solve this problem by better means, but since we're focusing on optimization, we are treating the function as a black box.
I have a quick and dirty implementation here, in include/points.hpp and src/points.cpp. I'll only show the header. This is a pretty simple data structure:
#pragma once
#include <vector>
#include <istream>
/* This is a data structure to help demonstrate Bayesian Optimization. See the
accompanying lecture notes. */
class Points {
public:
/* The only piece of data is a vector of points. Each element of the vector is
a vector of n points, and each point will be a double between 0 and 10. I'm
making it public, because I'm not in the mood to get all object oriented
about this. */
std::vector < std::vector <double> > p;
/* You'll note -- no constructors, destructors or assignment overloads. Since the
only piece of data is that vector of vectors, all the defaults will work. */
/* Reads from a stream and double-checks that everything is correct. If it
encounters an error, it will throw an exception of type std::runtim_error. */
void Read(std::istream &in);
/* Here is a function that we want to maximize. Given a point in this
n-dimensional space, what is the smallest distance to a point in p?
If x's size is not equal to the number of dimensions of p, it will throw
an exception of type std::runtim_error. */
double Min_Distance(const std::vector <double> &x) const;
};
|
I've got a really simple program in src/points_stdin.cpp, which takes a file of points as a command line argument, and then single points on standard input. For each point on standard input, it prints the minimum distance to a point in the file. I'll show a few files, because that will help demonstrate the optimization later. The first is txt/1d_10.txt, which is a file containing ten random points in 1D space:
UNIX> cat txt/1d_10.txt 6.6224292 9.4497210 8.9526480 9.0860032 5.9090157 1.7412893 3.3581418 0.3891034 8.7371743 9.4515564 UNIX>I have graphed these points below, and I show the minimum distance of each of integer points to one of these points in the graph:
![]() |
You can eyeball these points and do the optimization problem yourself -- it's clearly the midpoint between 3.3581418 and 5.9090157: 4.633579, whose distance is 1.275437 from either point. Let's verify these things with bin/points_stdin:
UNIX> sh -c '(for i in 1 2 3 4 5 6 7 8 9 4.633579 ; do echo $i; done ) | bin/points_stdin txt/1d_10.txt' 0.610897 0.258711 0.358142 0.641858 0.909016 0.0909843 0.377571 0.737174 0.047352 1.27544 UNIX>Here are five more data files:
![]() 2d_10.txt |
![]() 2d_100.txt |
![]() 2d_1000.txt |
/* Run through all of the points and find the one that optimizes the function. */
max = 0;
x.resize(points.p[0].size(), 0);
done = false;
while (!done) {
d = points.Min_Distance(x);
if (d > max) {
max = d;
best = x;
}
for (i = 0; i < x.size(); i++) {
x[i] += granularity;
if (x[i] <= 10) break;
x[i] = 0;
}
done = (i == x.size());
}
|
Now we can estimate the maximum point for each of the data sets. The amount of time, and the precision of the search depend on the data set. That tiny 1D dataset takes a blink of an eye, and the 2D data sets don't take long:
| Dataset | Granularity | Time | Maximum | Points |
| 1d_10.txt | 10-7 | 2.3 s | 1.27544 | [4.63358] |
| 2d_10.txt | 10-1 | 0.004 s | 3.36888 | [7.6, 3.8] |
| 2d_10.txt | 10-2 | 0.024 s | 3.39722 | [7.64, 3.79] |
| 2d_10.txt | 10-3 | 1.36 s | 3.40008 | [7.646, 3.783] |
| 2d_10.txt | 10-4 | 2m, 18s | 3.40047 | [7.6461, 3.7835] |
| 2d_100.txt | 10-2 | 0.141 s | 1.48968 | [5.51, 5.84] |
| 2d_100.txt | 10-3 | 12.2 s | 1.49261 | [5.513, 5.837] |
| 2d_1000.txt | 10-2 | 1.22 s | 0.550313 | [7.30, 4.56] |
| 2d_1000.txt | 10-3 | 2m 4s | 0.551108 | [7.295 4.561] |
For the 3D data set, it starts to take a long time, even for relatively high granularities. Moreover, the results don't simply refine the points as they did above:
| Dataset | Granularity | Time | Maximum | Points |
| 3d_1000.txt | 0.10 | 1.28 s | 1.37988 | [6.40, 2.40, 7.30] |
| 3d_1000.txt | 0.09 | 1.64 s | 1.39828 | [6.22, 2.44, 7.39] |
| 3d_1000.txt | 0.08 | 2.30 s | 1.39613 | [6.60, 2.34, 7.24] |
I've colored those values in red, because they are troublesome -- first, you'll note that two of the the points for 0.8 are more than 0.1 units away from the points for 0.9. That's what I mean by the lower granularities not "refining" the higher granularities. Worse, the maximum for a granularity of 0.8 is lower than the granularity for 0.9. Here are the maxima and timings for all granularities from 0.10 down to 0.01:
![]() |
That 9D data set is a true disaster. Using a granularity of one takes 37.5 minutes!
UNIX> time bin/points_brute 1 < txt/9d_1000.txt Best x: 1.00 1.00 9.00 9.00 1.00 4.00 1.00 1.00 1.00 Max: 8.25241 real 37m27.778s user 37m25.628s sys 0m1.027s UNIX>
![]() |
In the 9D data set, I set the parameters so that the Monte Carlo search would take about 37 minutes. It didn't come close to the grid search:
UNIX> time bin/points_monte 372000000 1 < txt/9d_1000.txt Best x: 5.87737 8.88675 1.33628 1.07946 1.59121 8.88738 1.28253 8.66066 2.0708 Max: 7.7405 real 36m47.246s user 36m44.227s sys 0m1.355s UNIX>
bool Print = false; /* If true, then min_function() will print what it's doing. */
/* To use BayesOpt, you need to define a testing function with their prototype: */
double min_function(unsigned int n, // Number of dimensions
const double *x, // The point to test - an array of n dobules
double *gradient, // Just set this to NULL.
void *func_data) // You use this to pass data to the function.
// In our case, it will be a pointer to the Points class
{
Points *p;
vector <double> xv;
unsigned int i;
double rv;
gradient = NULL; // We don't use this, so set it to NULL
/* My code simply converts the data from their format into a Min_Distance() call. */
p = (Points *) func_data;
for (i = 0; i < n; i++) xv.push_back(x[i]);
rv = p->Min_Distance(xv);
if (Print) {
for (i = 0; i < n; i++) printf(" %lg", x[i]);
printf(" %lg\n", rv);
}
return -rv; /* I return the negation, because the optimizer does minimization. */
}
|
Then, I have to set up their parameters and make a call to the procedure bayes_optimization().
int main(int argc, char **argv)
{
Points points;
double *lower, *upper, *x, *minf;
long long seed;
long long iterations;
bopt_params params;
size_t i;
/* Skipped code that processes parameters and reads in the points. */
.......
/* Set the parameters of the Bayesian Optimizer */
params = initialize_parameters_to_default();
params.n_iterations = iterations;
params.random_seed = seed;
params.verbose_level = 0;
lower = (double *) malloc(sizeof(double) * points.p[0].size());
upper = (double *) malloc(sizeof(double) * points.p[0].size());
x = (double *) malloc(sizeof(double) * points.p[0].size());
minf = (double *) malloc(sizeof(double) * points.p[0].size());
for (i = 0; i < points.p[0].size(); i++) {
lower[i] = 1.0;
upper[i] = 9.0;
}
bayes_optimization((int) points.p[0].size(),
min_function,
(void *) &points,
lower,
upper,
x,
minf,
params);
printf("Best x:");
for (i = 0; i < points.p[0].size(); i++) printf(" %lg", x[i]);
printf("\n");
printf("Max: %lg\n", -minf[0]);
return 0;
}
|
To compile, you need to specify BayesOpt's include directory and compiled libraries:
UNIX> make bin/points_bayes g++ -c -o obj/points_bayes.o -O3 -Wall -Wextra -I/Users/plank/src/repos/bayesopt/include -Iinclude src/points_bayes.cpp g++ -o bin/points_bayes obj/points.o obj/points_bayes.o /Users/plank/src/repos/bayesopt/lib/libbayesopt.a /Users/plank/src/repos/bayesopt/lib/libnlopt.a -lm UNIX>My main() allows you to set the number of iterations, the RNG seed, and whether or not to print each call to min_fitness:
UNIX> bin/points_bayes usage: points_bayes iterations seed print(t/f) - points on stdin UNIX>The default number of iterations that BayesOpt uses is 190, so let's try that here on the 2D data sets:
UNIX> time bin/points_bayes 190 1 f < txt/2d_10.txt Best x: 7.64349 3.78387 Max: 3.39845 # Our grid search yielded 3.40047 in two minutes. real 0m1.140s user 0m1.131s sys 0m0.007s UNIX> time bin/points_bayes 190 1 f < txt/2d_100.txt Best x: 5.53086 5.84168 Max: 1.48764 # Our grid search yielded 1.49261 in 12 seconds real 0m2.514s user 0m2.504s sys 0m0.008s UNIX> time bin/points_bayes 190 1 f < txt/2d_1000.txt Best x: 3.44244 8.84877 Max: 0.458114 # Our grid search yielded 0.551108 in two minutes. real 0m3.188s user 0m3.177s sys 0m0.008s UNIX>That last result looks pretty disappointing. Similarly, I get 1.3295 with 3d_1000.txt, as opposed to over 1.40 with the grid search. I suspect that the optimization is being caught in the large number of local minima that arise with all of the points in space. Delving into the documentation, I see that there are some variables that I can tweak. That sure feels like a rabbit hole, but one parameter that has some promise is n_init_iterations. This is the number of initial calls with fixed values so that the optimizer can start searching. The default is 10 -- let's bump it up to 100 and see if getting some more variety when seeding the search yields more success (I'm bumping the total iterations up as well):
UNIX> bin/points_bayes_2 290 1 100 f < txt/2d_1000.txt Best x: 7.31019 4.56059 Max: 0.548909 UNIX> bin/points_bayes_2 290 1 100 f < txt/3d_1000.txt Best x: 6.61776 2.36002 7.21627 Max: 1.40403 UNIX> bin/points_bayes_2 290 1 100 f < txt/9d_1000.txt Best x: 8.99409 1.00029 1.01946 1.00492 7.79535 1.00081 1 1.0075 8.99393 Max: 7.73639 UNIX>That's much better, isn't it? Moreover, that last call took 1.5 minutes to yield a result similar to the 37 minute Monte Carlo run. And it got the result with 290 calls to min_fitness(), as opposed to 372 million calls by the Monte Carlo run.
Obviously, there are a lot of things to explore with this type of optimization, and I'm not going to probe any more with this set of lecture notes. My goal was to introduce you to the concept, so that you can draw upon it if you need to later in your lives.