I know that's a bit vague, but that's the general principle. The running time is straightforward, too: At each step of your problem, you're throwing away half of the items, so after log(n) steps, you'll be left with one item, and you're done.
That means your binary search is generally O(log n), which is smoking fast. I say "generally", because sometimes you need to do extra stuff during a step, and that can increase your overhead.
UNIX> wc txt/words.txt 24853 24853 205393 txt/words.txt # There are 24,853 words UNIX> head -n 5 txt/words.txt # It looks sorted from the beginning aaa aaas aarhus aaron aau UNIX> tail -n 5 txt/words.txt # And end zoroastrian zounds zucchini zurich zygote UNIX> sort txt/words.txt > tmp.txt # We demonstrate that it is indeed sorted. UNIX> diff txt/words.txt tmp.txt UNIX>Let's write a program, in src/dict_bsearch.cpp. It's a straightforward binary search -- please read the comments inline for explanation.
/* src/dict_bsearch.cpp. This reads a dictionary of words from a file into a vector of strings.
It sorts the vector, if not already sorted, and then it reads words from standard input
It uses binary search to determine whether each of these words is in the dictionary. */
#include <vector>
#include <algorithm>
#include <iostream>
#include <fstream>
using namespace std;
class Bsearch {
public:
void Create(const string &filename); // Create the vector from the file.
bool Find(const string &word) const; // Return whether a word is in the vector.
protected:
vector <string> words;
};
/* Create() is straightforward -- it reads each words into a vector, and while doing so,
determines whether the vector is sorted. If it is not, then it is sorted at the
end of Create(). */
void Bsearch::Create(const string &filename)
{
ifstream fin;
bool sorted;
string w;
sorted = true;
fin.open(filename.c_str());
if (fin.fail()) throw (string) "Could not open " + filename;
while (fin >> w) {
if (words.size() > 0 && w < words[words.size()-1]) sorted = false;
words.push_back(w);
}
if (!sorted) sort(words.begin(), words.end());
}
/* Here's the binary search. It keeps track of three variables:
l = the index of the smallest word that we are considering.
h = the index of the largest word that we are considering.
m = the middle of l and h
It iterates by looking at words[m], and using that value to either return,
discard the lower half of elements or discard the higher half of elements.
*/
bool Bsearch::Find(const string &word) const
{
int l, h, m;
if (words.size() == 0) return false;
l = 0; // Initially, we consider the entire vector
h = words.size() - 1;
while (l <= h) {
m = (l + h) / 2;
// printf("l:%d(%s) m:%d(%s) h:%d(%s)\n",
// l, words[l].c_str(), m, words[m].c_str(), h, words[h].c_str());
if (words[m] == word) return true;
if (words[m] > word) h = m-1; // Throw away the top half
if (words[m] < word) l = m+1; // Throw away the bottom half
}
return false;
}
/* The main() is straightfoward -- create the dictionary from the file, then find each
word on standard input. */
/* I'm not including it here in the lecture notes. */
|
Let's test by uncommenting the printf() statement inside Find(), and doing some small examples:
UNIX> wc txt/words-12.txt # My dictionary has 12 words
12 12 99 txt/words-12.txt
UNIX> cat txt/words-12.txt
attention
debtor
efficient
goldenseal
highwaymen
hogan
moth
rebutted
salsify
stud
wakeful
woodpeck
# I'm going to find attention, debtor, efficient and hogan, which are all there:
UNIX> echo attention | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
l:0(attention) m:2(efficient) h:4(highwaymen)
l:0(attention) m:0(attention) h:1(debtor)
attention: found
Found: 1 of 1
UNIX> echo debtor | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
l:0(attention) m:2(efficient) h:4(highwaymen)
l:0(attention) m:0(attention) h:1(debtor)
l:1(debtor) m:1(debtor) h:1(debtor)
debtor: found
Found: 1 of 1
UNIX> echo efficient | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
l:0(attention) m:2(efficient) h:4(highwaymen)
efficient: found
Found: 1 of 1
UNIX> echo hogan | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
hogan: found
Found: 1 of 1
# Now aaa, zzz and mmm, which are not there:
UNIX> echo aaa | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
l:0(attention) m:2(efficient) h:4(highwaymen)
l:0(attention) m:0(attention) h:1(debtor)
aaa: not-found
Found: 0 of 1
UNIX> echo zzz | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
l:6(moth) m:8(salsify) h:11(woodpeck)
l:9(stud) m:10(wakeful) h:11(woodpeck)
l:11(woodpeck) m:11(woodpeck) h:11(woodpeck)
zzz: not-found
Found: 0 of 1
UNIX> echo mmm | bin/dict_bsearch txt/words-12.txt y
l:0(attention) m:5(hogan) h:11(woodpeck)
l:6(moth) m:8(salsify) h:11(woodpeck)
l:6(moth) m:6(moth) h:7(rebutted)
mmm: not-found
Found: 0 of 1
UNIX>
It's a good idea to go over the examples above and look at the indices, to see how it hones
the search space at each step. Let's look at a bigger example to see what happens when
it tries to find "jjj" in txt/words.txt. I'm going to have that print statement
print (h-l) at each step, so you can see how it roughly halves at each step:
UNIX> echo jjj | bin/dict_bsearch txt/words.txt y h-l:24852 l:0(aaa) m:12426(jewelry) h:24852(zygote) h-l:12425 l:12427(jewett) m:18639(refractory) h:24852(zygote) h-l:6211 l:12427(jewett) m:15532(nightfall) h:18638(refractometer) h-l:3104 l:12427(jewett) m:13979(mambo) h:15531(nightdress) h-l:1551 l:12427(jewett) m:13202(legendary) h:13978(maltreat) h-l:774 l:12427(jewett) m:12814(knapsack) h:13201(legend) h-l:386 l:12427(jewett) m:12620(kamchatka) h:12813(knapp) h-l:192 l:12427(jewett) m:12523(joyous) h:12619(kalmuk) h-l:95 l:12427(jewett) m:12474(johns) h:12522(joyful) h-l:46 l:12427(jewett) m:12450(joanna) h:12473(johnny) h-l:22 l:12427(jewett) m:12438(jimenez) h:12449(joan) h-l:10 l:12439(jimmie) m:12444(jitterbug) h:12449(joan) h-l:4 l:12445(jitterbugger) m:12447(jittery) h:12449(joan) h-l:1 l:12448(jive) m:12448(jive) h:12449(joan) h-l:0 l:12449(joan) m:12449(joan) h:12449(joan) jjj: not-found Found: 0 of 1 UNIX>
I have implemented the set and unordered_set code in src/dict_set.cpp and src/dict_uset.cpp respectively. To test, I have created txt/test.txt, which has 12,000 words from txt/words.txt, and 12,000 words that are not in txt/words.txt. Let's see how they compare:
UNIX> make clean rm -f bin/* UNIX> make bin/dict_bsearch bin/dict_set bin/dict_uset g++ -o bin/dict_bsearch -Wall -Wextra -std=c++11 src/dict_bsearch.cpp g++ -o bin/dict_set -Wall -Wextra -std=c++11 src/dict_set.cpp g++ -o bin/dict_uset -Wall -Wextra -std=c++11 src/dict_uset.cpp UNIX> time bin/dict_bsearch txt/words.txt < txt/test.txt n Found: 12000 of 24000 real 0m0.151s user 0m0.147s sys 0m0.003s UNIX> time bin/dict_set txt/words.txt < txt/test.txt n Found: 12000 of 24000 real 0m0.226s user 0m0.220s sys 0m0.004s UNIX> time bin/dict_uset txt/words.txt < txt/test.txt n Found: 12000 of 24000 real 0m0.089s user 0m0.084s sys 0m0.003s UNIX>Predictably, the unordered_set was the fastest. The binary search is significantly faster than the set, even though they have the same big-O. Part of that is because we don't sort the words (they are already sorted), but the significant savings actually come from memory. The set uses a tree data structure, which has a lot of pointers and extra memory. The binary search simply uses the vector.
Keep that in mind.
bool Bsearch::Find(const string &word) const
{
int start, size, mid;
start = 0;
size = words.size();
if (size == 0) return false;
while (size > 1) {
mid = start + size/2;
/* I guess I like how this code translates logically: */
if (words[mid] > word) { /* If word is not in the second half... */
size /= 2; /* Discard the second half. */
} else {
start += size/2; /* Otherwise discard the first half. */
size -= size/2; /* Note this handles even and odd sizes correctly. */
}
}
return (words[start] == word);
}
|
It's faster than the previous code (it was about 0.151 rather than 0.114 here):
UNIX> time bin/dict_bs_ss txt/words.txt < txt/test.txt n Found: 12000 of 24000 real 0m0.114s. user 0m0.110s. sys 0m0.003s UNIX>Plus it has another advantage of the previous code -- if the vector contains duplicates, this will always return the last of the duplicates.
![]() |
Your goal is to either eliminate the light blue region or the pink region. Sometimes that's pretty natural, as in the code above. Sometimes less so. The next section is an example.
Let me define a function f(v) whose answer is "yes/no". Let's suppose that f(v) is O(n). Moreover, suppose that there is a value vopt, such that for all v < vopt, f(v) is "no", and for all v ≥ vopt, f(v) is "yes". Then we can use binary search on v to find vopt. The running time is going to be O(n log v).
A good example of this is Leetcode Medium problem #875: "Koko Eating Bananas". Here's the problem on Leetcode if you want to try it yourself: https://leetcode.com/problems/koko-eating-bananas/.
Here's a summary of the problem:
The answer is 4:
That's 8 timesteps. You'll note that if you set k to 3, then it will take you 10 timesteps.
The answer is 30, because you have to reduce each pile to zero in a single timestep.
The answer is 23. That way, you get piles 1 through 4 in one timestep, and pile 0 in two tiemsteps.
Here's a good strategy for developing binary search solutions like this one. First, write the code to solve it with linear search. Here what we'll do is have k go from 1 to the maximum pile size, and print out whether we can solve the problem with k.
I have put that code into src/Koko_Linear.cpp, and we'll run it on the three examples above:
UNIX> g++ src/Koko_Linear.cpp UNIX> echo 3 6 7 11 8 | a.out k: 1 Timesteps: 27. Success: N k: 2 Timesteps: 15. Success: N k: 3 Timesteps: 10. Success: N k: 4 Timesteps: 8. Success: Y # So the answer will be 4. k: 5 Timesteps: 8. Success: Y k: 6 Timesteps: 6. Success: Y k: 7 Timesteps: 5. Success: Y k: 8 Timesteps: 5. Success: Y k: 9 Timesteps: 5. Success: Y k: 10 Timesteps: 5. Success: Y k: 11 Timesteps: 4. Success: Y 0 UNIX> echo 30 11 23 4 20 5 | a.out k: 1 Timesteps: 88. Success: N .... # Skipping k: 29 Timesteps: 6. Success: N k: 30 Timesteps: 5. Success: Y # The answer is 30. 0 UNIX> echo 30 11 23 4 20 6 | a.out k: 1 Timesteps: 88. Success: N .... # Skipping k: 22 Timesteps: 7. Success: N k: 23 Timesteps: 6. Success: Y # The answer is 23. k: 24 Timesteps: 6. Success: Y .... # Skipping UNIX>Now, we want to turn that linear search into a binary search. We'll start by searching on numbers from 0 to the max_pile_size, so:
![]() |
We need to decide whether we want to look at the element in start+size/2-1 or start+size/2. Let's think about what they tell us:
int Solution::minEatingSpeed(vector<int>& piles, int h)
{
int start, size, maxpile, mid;
size_t i;
long long timesteps, for_pile;
/* Calculate the maximum pile size. */
maxpile = piles[0];
for (i = 0; i < piles.size(); i++) if (piles[i] > maxpile) maxpile = piles[i];
/* We want our range to start at one and end at maxpile (including maxpile).
So we set start to 1 and size to maxpile. Remember size means that start+size is one
past the last element in our region. */
start = 1;
size = maxpile;
while (size > 1) {
/* You want to test the highest value in the first half of the values
(the last value in the blue region of the picture). */
mid = start + size/2 - 1;
/* Timesteps is the total timesteps if k is set to mid. */
timesteps = 0;
for (i = 0; i < piles.size(); i++) {
for_pile = piles[i] / mid;
if (piles[i]%mid != 0) for_pile++;
timesteps += for_pile;
}
// printf("Start: %d. Size: %d Mid: %d. Timesteps: %lld\n", start, size, mid, timesteps);
/* If timesteps is too big, then you know that the answer
is in the second half of the range. You can throw out the
first half. */
if (timesteps > h) {
start += size/2;
size -= size/2;
/* Otherwise, the answer is in the first half of the range, so toss out the second half. */
} else {
size = size/2;
}
}
return start;
}
|
(If you care, that solution was pretty much smack at 50% in terms of speed on Leetcode).
Here's my description:
nums = [10, 1, 2, 7, 1, 3 ] p = 2So we need to find two pairs and minimize the maximum difference within a pair. The answer here is 1 -- { (1, 1), (2, 3) }. The two differences are 0 and 1, so the maximum difference is 1. It's the best you can do, so that's the answer.
To formulate this as a binary search problem, let's define f(v) as follows:
So -- continuing with the problem formualation: If you can implement f(v) in O(n) time (where n is the size of the vector), then you can use binary search to find vopt.
So we need to implement f(v). To do that we can sort the vector, and the proceed greedily. Look at nums[0]. If (nums[1]-nums[0]) ≤ k, then we count it as a pair and move onto nums[2]. Otherwise, we ignore nums[0] and move onto nums[1]. You can prove to yourself that if there are p pairs, then this algorithm will find it. I won't do that formally, but you should give it some thought to convince yourself that this is true.
Let's code it up. The Leetcode class/method is:
class Solution {
public:
int minimizeMax(vector <int> &nums, int p);
};
|
In src/Min_The_Max_Skeleton.cpp I have a skeleton that reads in the array and p and then calls minimizeMax(). As always, it compiles and runs, but not correctly:
UNIX> make bin/Min_The_Max_Skeleton g++ -o bin/Min_The_Max_Skeleton -Wall -Wextra -std=c++11 src/Min_The_Max_Skeleton.cpp UNIX> echo 10 1 2 7 1 3 2 | bin/Min_The_Max_Skeleton 0 UNIX>We'll use the same strategy as before -- we know the answer needs to be a value between 0 and the maximum element, so let's test all of those with a linear loop. We do that in src/Min_The_Max_Linear.cpp We'll run it on their example -- you'll see that the first answer that works is 1, which is the correct answer:
UNIX> g++ src/Min_The_Max_Linear.cpp UNIX> echo 10 1 2 7 1 3 2 | ./a.out ans: 0. num_pairs: 1. Success: N ans: 1. num_pairs: 2. Success: Y # This is the right answer. ans: 2. num_pairs: 2. Success: Y ans: 3. num_pairs: 3. Success: Y ans: 4. num_pairs: 3. Success: Y ans: 5. num_pairs: 3. Success: Y ans: 6. num_pairs: 3. Success: Y ans: 7. num_pairs: 3. Success: Y ans: 8. num_pairs: 3. Success: Y ans: 9. num_pairs: 3. Success: Y ans: 10. num_pairs: 3. Success: Y 0 UNIX>This is the same problem as above, as as before, we can use the element in size+size/2-1 to be the one that we test in our binary search:
int Solution::minimizeMax(vector <int> &nums, int p)
{
int ans;
int max;
int i, num_pairs;
int start, size;
sort(nums.begin(), nums.end());
max = nums[nums.size()-1];
start = 0;
size = max+1; // This is because the range of answers is [0,max].
while (size > 1) {
ans = start+size/2-1;
num_pairs = 0;
for (i = 1; i < nums.size(); i++) {
if (nums[i] - nums[i-1] <= ans) {
num_pairs++;
i++;
}
}
if (num_pairs >= p) {
size /= 2;
} else {
start += size/2;
size -= size/2;
}
}
return start;
}
/* Read in nums and p, and call the method. */
We can prove to ourselves that it works with the examples:
|