Binary Search Reconsidered

   
Recent changes
Table of contents
Links to this page
FRONT PAGE / INDEX

Subscribe!
@ColinTheMathmo

My lastest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:
I'm seriously thinking about turning this "upside down" and putting the final version and summing up at the top, then the discussion of where it all comes from after that. Please email me if you have an opinion - thanks. Here are the contents:

Binary Search Reconsidered - 2011/04/18

"Binary Search" was made popular as an interesting test problem by Jon Bentley in his book Programming Pearls. There he says that it's an interesting problem, and baits the hook by saying:

"I've assigned this problem in courses at Bell Labs and IBM.

...

"Professional programmers had a couple of hours to convert the above description into a program in the language of their choice ... At the end of the specified time, almost all the programmers reported that they had correct code for the task ... ninety percent of the programmers found bugs in their programs

...

"I was amazed: given ample time, only about ten percent of professional programmers were able to get this small program right."

And so the problem has attained a certain status.

I'm not going to examine all the issues here. There is a superb series of blog posts by Mike Taylor that can be found starting here:

(Note: there are a couple of extra posts interspersed ...)

Some people object to being made to examine this problem. They say it's not relevant to modern programming, and that very few people are called upon to write algorithmic code from scratch. To say that, I believe, is to miss the point. The issue is addressed eloquently in one of the comments. Further, some of the comments are quite vocal about how the ideas of "proving code" interact with those of "testing code."

These questions are covered quite comprehensively in the posts and other comments.

The thing is, the problem is subtle. Again quoting from Bentley:

" ... in the history in Section 6.2.1 of his Sorting and Searching, Knuth points out that while the first binary search was published in 1946, the first published binary search without bugs did not appear until 1962."
  • Jon Bentley, Programming Pearls (1st edition), pp.35,,36

Even then, as recently as 2006 the question arose again:

"Nearly All Binary Searches and Mergesorts are Broken

...

"I remember vividly Jon Bentley's first Algorithms lecture at CMU, where he asked all of us incoming Ph.D. students to write a binary search, and then dissected one of our implementations in front of the class. Of course it was broken, as were most of our implementations. This made a real impression on me, as did the treatment of this material in his wonderful Programming Pearls (Addison-Wesley, 1986; Second Edition, 2000). The key lesson was to carefully consider the invariants in your programs.

...

"Fast forward to 2006. I was shocked to learn that the binary search program that Bentley proved correct and subsequently tested in Chapter 5 of Programming Pearls contains a bug."

As such, examining this particular problem serves as a microcosm in which to explore some aspects of programming in general.

Again, the referenced posts cover the code and techniques in considerable depth. What I want to talk about here is a small simplification of the code that Mike Taylor offers in his posts.

So let's look at some actual code.

Algorithm description

The idea is simple. You're looking for an item in a sorted list, so you look in the middle. If the thing you find is bigger than what you're looking for, then the item - if it's there at all - must be in the first half, otherwise it's in the second half.

Lather, rinse, repeat.

You'd think it would be simple, but the evidence of all the discussion, and the more recent article from 2006, suggests that it's harder than you think. I strongly advise you to try the challenge as laid out in the first blog post:

Spoilers follow ...

Example code ...

Mike starts with the invariant:

  • if i is the location of val in a,
    • then lower <= i <= upper

Here is the code that Mike derived:

(Note: I've taken the liberty to reformat it and make non-functional changes. Refer to the original to check if you're concerned)

 
    int binary_search(
        const int a[],
        const int size,
        const int val
    ) {
        int lower = 0;
        int upper = size-1;

        while (lower <= upper) {
            int i = lower + (upper-lower)/2;
            if (val == a[i]) return i;
            if     (val < a[i])   upper = i-1;
            else /* val > a[i] */ lower = i+1;
        }
        return -1;
    }

I was stupid - I claimed:

  • There is a simpler invariant and simpler code that together have a few advantages:

    • 1. No overflow or underflow problems
    • 2. Finds the first element among those that compare equal
      • (Actually the version I'm presenting here finds the last among equals)
    • 3. Has only two cases to branch between and not three.

  • On the downside it still does log(size) comparisons even if it finds the desired key on the first comparison.

The first version of this code - written without reference to Mike's code - had n for the size of the array. That was because it gets changed, and so no longer represents the size of the array, but instead represents the size of the section still under consideration. I've changed it to n at Mike's suggestion to allow easier comparison between my routine and his.

My first version

So here is the invariant I had in mind - it's very similar to Mike's:

  • if i is the location of val in a,
    • then lower <= i < lower+size

Note that I've specified a half-open interval.

 
    int find(int a[], int size, int val) {
      int lower = 0;
      if (size<=0) return -1;
      while (1) {
        if (size==1)
            return (val==a[lower]) ? lower : -1;
        int m = size/2;
        if (val<a[lower+m]) size  = m;
        else                size -= m, lower += m;
      }
      return -1; /* Never used */
    }

When we enter the while loop we know size is at least 1. After the test, size is at least 2, so 1<=m<=size. Therefore it's valid to probe that location, and we can see that size constantly gets smaller, and so the test eventually succeeds and the routine returns.

We can check the invariant like this:

  • Assume that the value is in the array.
  • Assume that the last occurence is at index i
  • Our invariant is that
    • lower <= i < lower+size

  • We test for our element at location a[lower+m].
    • If the value at index lower+m is too big,
      • then i<lower+m.
      • Thus we set size to be m and
      • our invariant is preserved.

    • Otherwise we know that i>=lower+m
      • keep our upper bound fixed and
      • move our lower bound by
      • moving lower to m and
      • subtracting m from size

In the first case we reduce the range of elements we're searching over simply by reducing the value of size. In the other case we move up the lower bound by some amount and reduce the size by the same amount.

Using explicit lower and upper bounds ...

This can all equivalently be written using explicit upper and lower bounds. Some people find this a retrograde step in simplicity and clarity, but it serves as a stepping stone.

 
    int find(int a[], int size, int val) {
        int lower = 0, upper=size;
        if (upper<=0) return -1;
        while (1) {
            if (upper-lower==1)
                return (val==a[lower]) ? lower : -1;
            int m = (upper-lower)/2;
            if (val<a[lower+m]) upper = m;
            else                lower = m;
        }
        return -1; /* Never used */
    }

Here size has been replaced by upper-lower everywhere, with appropriate algebraic simplifications.

Simplifying ...

We can now move the return statement outside the while loop:

 
    int find(int a[], int size, int val) {
        if (size<=0) return -1;
        int lower = 0, upper=size;
        while (upper-lower>1) {
            int m = lower+(upper-lower)/2;
            if (val<a[m]) upper = m;
            else          lower = m;
        }
        return (val==a[lower]) ? lower : -1;
    }

Brief side point - sometimes we need to make code messier in order to facilitate a re-write that becomes clearer than the original. This is perhaps somewhat akin to climbing down from a local maximum in hill-climbing so we can get to a higher - possibly highest - maximum the other side of the valley.
This version is even shorter and cleaner. The invariant says that i - if it exists at all - is in the half-open interval [lower,upper). That means that it's either in [lower,m) or [m,upper) and the code simply picks the correct one.

There are other things to check. The arithmetic can never overflow, and we never access the array outside of the bounds that the parameters guarantee exist. The value of size ( or its equivalent upper-lower ) is strictly decreasing and positive, so the loop terminates when it gets to 1.

It looks pretty bullet-proof. (Famous last words)

Note that I don't need to worry about signedness of variables or possible overflow conditions, regardless of the language variant being used, because the structure of the code ensures that I never go outside the boundaries guaranteed by the values passed in.

Taking stock, and summing up.

So let's think about the differences between this and the version in the blog post.

One difference is that this never terminates early. It always takes log(n) steps, and if there are equal keys, it always returns the last one. That's an interesting change in behaviour, and should be kept in mind. It makes the time behaviour consistent, but it means you don't get a quick result if you happen to be lucky on an early guess.

But looking at the code, we can see that we don't set one of the bounds ( lower and lower+n ) to be outside the potential permitted probe locations. Mike's code sets lower to i+1, and you might need to worry about whether that overflows.

All this is quite "fluffy" and perhaps deserves expansion, but I'm reluctant to get drawn into the flame-wars that have often erupted around this issue.
Actually it doesn't, but the correctness of the second routine under potential overflow conditions is easier to reason about. In particular, reasoning as to why Mike's routine doesn't have a problem with overflow or underflow requires reasoning about the specific flavour of C being used. My feeling is that somehow that means the reasoning is more "brittle" in some sense. I freely agree that sometimes knowing such things is crucial, but if it can be avoided, then it brings under one's own control issues that otherwise are pushed out to the implementation.

But none of that is conclusive. Is it simpler?

Personally, I find the detailed reasoning from this invariant easier than in Mike's case, but perhaps that's just selection bias. After all, I wrote this code, and Mike wrote his, so perhaps the code reflects our respective ways of thinking.

What do you think?


<<<< Prev <<<<
Two Equals Four
:
>>>> Next >>>>
Do You Nourish Or Tarnish


You should follow me on twitter @ColinTheMathmo


Here are Mike's blog posts:


Contents

 

Links on this page

 
Site hosted by Colin and Rachel Wright:
  • Maths, Design, Juggling, Computing,
  • Embroidery, Proof-reading,
  • and other clever stuff.

Suggest a change ( <-- What does this mean?) / Send me email
Front Page / All pages by date / Site overview / Top of page

Universally Browser Friendly     Quotation from
Tim Berners-Lee
    Valid HTML 3.2!