The Birthday Paradox

   
Recent changes
Table of contents
Links to this page
FRONT PAGE / INDEX

Subscribe!
@ColinTheMathmo

My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:

The Birthday Paradox - 2012/11/06

A classic puzzle/paradox is to ask:

You may feel like you've seen this before. If you have, skip to the end where I talk about a formula, and discuss the relevance to parameters in computer hashing algorithms. In particular, I'm going to talk about how large hash spaces should be to avoid collisions with a given level of probability.
If there are just two people in a room, it's very unlikely that they will have the same birthday. On the other hand, if there are 1000 people in a room, it's absolutely certain that there will be shared birthdays, as there simply aren't enough days to go round without repeats. So as we add people to a room the chances of a shared birthday rise from 0 to 1, and at some point will pass through the halfway mark.

We assume that birthdays are distributed uniformly at random throughout the year.
How many people do you need to have in a room before the chances are more than 50% of a shared birthday?

If you haven't seen this before then I urge you to have a guess now. Is it 180 people? More than that? Fewer than that? What do you think?

Most people who haven't seen this before get it very wrong, and then are surprised by the answer.

In case you are one of the many, many people who get this wrong and are astonished, you may be thinking of a different question. My birthday is a specific date. If we assume birthdays are uniformly distributed, how many people must you ask before you find someone who has the same birthday as me?

On average, about 182.

If you ask some 150 to 200 people, the odds are about 50% that you'll find someone who shares my specific birthday.

But that's not the question I asked. They don't just need to share my birthday, it can be that any birthday is shared among them, and that changes the odds dramatically.

Johnny Carson got it wrong - here's a video clip:

The number of people?

Just 23.

Yes, as soon as you have 23 people in a room the chances that two of them share a birthday are just over 50%. This is true, even if you allow 366 days in the year, and it actually becomes more true (whatever that means!) if you take into account that birthdays are not evenly distributed throughout the year.

Why is it so small?

Suppose we have one person in the room, and consider what happens as we add more people. The first extra person only has to avoid one birthday, and that's not so hard. But then the next person has to avoid two existing birthdays, and the next has to avoid three existing birthdays. Not only must everyone who has come before have succeeded in avoiding the coincidence, but it gets harder as we go along. This accumulation of avoided coincidences eventually gets you, and sooner than you may think.

Indeed, another way to view this is to look at how many pairings there are of the people in the room. With 23 people there are 253 possible pairs, more than half the number of days in the year, so suddenly it's not all that implausible that we have a shared birthday. (That's quite a lot more than half 365, but we may have three people sharing a birthday, or more, and the sums get quite complicated).

And normal people stop there. They might be surprised by the result, and some don't even believe it, but certainly most people stop.

Mathematicians aren't normal people ...

But mathematicians aren't normal people, and they might ask - what happens in general?

This is relevant when we talk about avoiding hash collisions in computing systems, so it makes sense to talk about huge numbers of possible "birthdays" and huge numbers of "people." We can ask, when we have a huge hash space, how many keys can we choose before there's a 1% chance of a collision? (for some value of 1%).

In case you don't know about hashes and hash spaces, a "hash function" is a way of taking a sort of "fingerprint" of an object. The result is the hash of the object, and a hash is usually designed to be a random-looking collection of bits. Hashes used to be of size 32 or 64 bits, because that fitted nicely into computer storage units, but these days hashes tend to be much larger.

Hashes usually have the property that changing the object ever so slightly changes the hash pretty much apparently at random. Cryptographic hashes also have the property that it's really, really hard to deduce anything about the original object, even if you know exactly how the hash is computing.

So what happens as we deal with larger numbers, and with different probabilities? So let's see how we get to the answer of 23 people given 365 possible birthdays, and then generalise from there.

To do the analysis we turn this around and ask - what is the probability that we have no collisions? For just one person the answer is 1 - there is no chance of sharing a birthday, so there is a 100% chance of no shared birthday.

With two people, the second person must avoid the birthday of the first person. Assuming 365 days in the year this is then 364/365, being the number of days allowed, divided by the number of days from which we must choose.

Now add another person. To avoid a coincidence we must, in addition to the first coincidence dodged, now avoid two existing days. That has a chance of (365-2)/365. And so on, and these accumulate. So the chance of 10 people avoiding each other's birthdays is:

We can make that a bit neater by saying that the first person has 0 days to avoid, so that makes it:

and there are 10 terms in total. By extension we can see that for k people, the chances they all avoid each other's birthdays is:

If we also replace 365 by N to make this even more general, and we observe that:

Just as a reminder, we use n! to mean n(n-1)(n-2)(n-3)...3x2x1, so that means 6!=6x5x4x3x2x1=720. The exclamation mark is sometimes pronounced "pling" and this operation is called "factorial".

Note that elsewhere I use the period to mean "times", instead of using "x".

(where the pling represents the factorial operation) then we can write this fairly succinctly as:

So now we can ask - for a given value of N, what value of k first makes this greater than 50%?

What do we mean by the interpolated answer?

We never hit exactly 50%. So for some value of k we're below, and for k+1 we're above. What we can then do is interpolate between those values of k to get a non-integer that would be (almost exactly) the right answer. I'm using linear interpolation because we're working over a small range and it seems reasonable.

We can do this "exactly" by just computing the answer for lots of values of k and then interpolating between them to get an answer. This is a numerical technique, and if we do it lots and lots we might see a pattern emerging. It seems that the value k that gives us a probability of 50% of a collision is about Looking more closely, we can see that there's a constant c so that if we choose numbers, then our probability of a clash is about 50%.

And c turns out to be about 1.17741...

Where did that come from?

Now, you probably don't recognise that number, certainly I didn't. But after a bit of futzing about I found it to be remarkably close to

Where does that come from? Let's find out.

We need to analyse this:

Factorials are quite nasty to deal with, but we can use Stirling's approximation which says that:

Substituting that into equation (1), expanding and simplifying results in this somewhat scary beast:

That's actually much simpler than we might otherwise expect, but it's still pretty tough to see where to go from here. However, the experienced eye will see something that looks familiar.

We see something like this:

We've seen this before. Or at least, I have, and anyone else who has done some significant calculus, or combinatorics, or pretty much any more advanced mathematics. We know that as x gets larger:

More than that, if k is constant then we have:

But k is the number of things we choose, and that doesn't stay constant as N gets bigger. Still, all that means is that we have to work a little harder.

The above rules come from recognising that there is a series for logarithms. In particular, the Taylor expansion for logs is:

That means, after simplification:

Now, with some trepidation, we can go back to our original:

Taking logs of both sides:

Substitute our expression for (watch out for the minus signs!) and we get:

We're going to assume that k is "small" compared with N. In fact, we're going to assume that k is about

We can prove that, or carry more terms, but it's not appropriate here.

This expands and simplifies to:

Isn't that amazingly simple?

Analysis of the analysis

I'm going to have a bit of a rant here.

In the above calculations there were at least three occasions where I recognised things because I was familiar with them, had used them, had played with them, and they were, in a sense, my "friends."

People often ask "Why did you make that approximation?" or "How did you know that would work?" The short answer is often "I didn't, but it felt right."

People often ask why they need to memorise formulas, or why they need to practice solving equations, when they can simply look stuff up whenever they need it, and on-line computer algebra systems can solve equations faster than they can, and more reliably.

But this is an example of why the ability simply to look stuff up is near useless on its own. Searches are deep and wide, and you need intuition to guide you. You need to recognise what might work, things you've seen before, directions to take that are more likely to be fruitful.

Or profitable.

The day probably will come when computers can do all of that better that we can, but that day isn't here yet. We still need human intuition, built from experience and practice, to guide the computer searches, to know what is more likely to work.

If you already know how to do this sort of calculation then you're probably nodding. If you don't, and you can't see how someone can possibly do this kind of stuff, this comment is for you. Practice and experience.

Play.

Once you play with things, the ability to invent and improvise is unleashed.

So if we're looking at a 50% chance of a coincidence, then P=0.5, and in that case ln(P)=-ln(2). Our formula then tells us that for large N, the number of people needed to have a 50% chance of duplicate birthdays is about or where

That's about 1.17741 times and that explains our initial observation.

It also explains clearly where the ln(4) comes from. It's actually -2.ln(0.5) and the 2 comes from the Taylor Expansion of log, and the 0.5 is the probability.

Our formula is also pretty accurate, even for only moderate values of N. For N=365, our original birthday problem, that's just a shade under 22.5, whereas the interpolated answer is 22.77. If we pretend that we have 1000 days in a year, the interpolated true answer is 37.5 people, and the formula gives 37.233.

So if we have a pool of N items and we select from them at random, with the possibility of repeats, by the time we have selected 1.2 times items we have a 50% chance of a collision. This is relevant when designing hash tables in computing, and using hashes to represent items.

Our formula gives us more, though, because we can substitute any given desired probability of a clash. Suppose you want a 90% chance of no collision. Then

For a million items, we have a 10% chance of no collisions with 2146 selections, we have a 50% chance of no collisions with 1178 selections, and if you want 90% chance of no collisions then you can only select 459 items.

That's quite small, which is why hash spaces have to be so huge to avoid collisions. It used to be quite common to use 40-bit CRC checks, but with only a million objects there's a 36% chance of a hash collision. Even using 64 bit hashes it only takes a billion objects to have a 2.6% chance of a collision, and 2 billion to have a 10% chance of a collision.

In summary, choosing from N items, with N large, and wanting a probability of T of having no collisions, how many items can you choose at random with replacement?

Answer:


My thanks (in no particular order) to Wendy Grossman ( http://www.pelicancrossing.net ), Patrick Chkoreff, David Bedford, @ImC0rmac, @tombutton, @Risk_Carver, @standupmaths, @hornmaths, @snapey1979, and @pozorvlak for comments on early drafts.

Added in edit: Someone at last has found the deliberate error - congratulations to @jimmykiselak. I'll leave it in place for now for others to ponder over.


<<<< Prev <<<<
The Trapezium Conundrum
:
>>>> Next >>>>
NASA Space Crews


You should follow me on twitter @ColinTheMathmo

Comments

I've decided no longer to include comments directly via the Disqus (or any other) system. Instead, I'd be more than delighted to get emails from people who wish to make comments or engage in discussion. Comments will then be integrated into the page as and when they are appropriate.

If the number of emails/comments gets too large to handle then I might return to a semi-automated system. We'll see.


Contents

 

Links on this page

 
Site hosted by Colin and Rachel Wright:
  • Maths, Design, Juggling, Computing,
  • Embroidery, Proof-reading,
  • and other clever stuff.

Suggest a change ( <-- What does this mean?) / Send me email
Front Page / All pages by date / Site overview / Top of page

Universally Browser Friendly     Quotation from
Tim Berners-Lee
    Valid HTML 3.2!