My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:
The Birthday Paradox - 2012/11/06
A classic puzzle/paradox is to ask:
If you haven't seen this before then I urge you to have a guess now. Is it 180 people? More than that? Fewer than that? What do you think?
Most people who haven't seen this before get it very wrong, and then are surprised by the answer.
Yes, as soon as you have 23 people in a room the chances that two of them share a birthday are just over 50%. This is true, even if you allow 366 days in the year, and it actually becomes more true (whatever that means!) if you take into account that birthdays are not evenly distributed throughout the year.
Why is it so small?
Suppose we have one person in the room, and consider what happens as we add more people. The first extra person only has to avoid one birthday, and that's not so hard. But then the next person has to avoid two existing birthdays, and the next has to avoid three existing birthdays. Not only must everyone who has come before have succeeded in avoiding the coincidence, but it gets harder as we go along. This accumulation of avoided coincidences eventually gets you, and sooner than you may think.
Indeed, another way to view this is to look at how many pairings there are of the people in the room. With 23 people there are 253 possible pairs, more than half the number of days in the year, so suddenly it's not all that implausible that we have a shared birthday. (That's quite a lot more than half 365, but we may have three people sharing a birthday, or more, and the sums get quite complicated).
And normal people stop there. They might be surprised by the result, and some don't even believe it, but certainly most people stop.
Mathematicians aren't normal people ...
But mathematicians aren't normal people, and they might ask - what happens in general?
To do the analysis we turn this around and ask - what is the probability that we have no collisions? For just one person the answer is 1 - there is no chance of sharing a birthday, so there is a 100% chance of no shared birthday.
With two people, the second person must avoid the birthday of the first person. Assuming 365 days in the year this is then 364/365, being the number of days allowed, divided by the number of days from which we must choose.
Now add another person. To avoid a coincidence we must, in addition to the first coincidence dodged, now avoid two existing days. That has a chance of (365-2)/365. And so on, and these accumulate. So the chance of 10 people avoiding each other's birthdays is:
We can make that a bit neater by saying that the first person has 0 days to avoid, so that makes it:
and there are 10 terms in total. By extension we can see that for k people, the chances they all avoid each other's birthdays is:
If we also replace 365 by N to make this even more general, and we observe that:
(where the pling represents the factorial operation) then we can write this fairly succinctly as:
So now we can ask - for a given value of N, what value of k first makes this greater than 50%?
And c turns out to be about 1.17741...
Where did that come from?
Now, you probably don't recognise that number, certainly I didn't. But after a bit of futzing about I found it to be remarkably close to
Where does that come from? Let's find out.
We need to analyse this:
Factorials are quite nasty to deal with, but we can use Stirling's approximation which says that:
Substituting that into equation (1), expanding and simplifying results in this somewhat scary beast:
That's actually much simpler than we might otherwise expect, but it's still pretty tough to see where to go from here. However, the experienced eye will see something that looks familiar.
We see something like this:
We've seen this before. Or at least, I have, and anyone else who has done some significant calculus, or combinatorics, or pretty much any more advanced mathematics. We know that as x gets larger:
More than that, if k is constant then we have:
But k is the number of things we choose, and that doesn't stay constant as N gets bigger. Still, all that means is that we have to work a little harder.
The above rules come from recognising that there is a series for logarithms. In particular, the Taylor expansion for logs is:
That means, after simplification:
Now, with some trepidation, we can go back to our original:
Taking logs of both sides:
Substitute our expression for (watch out for the minus signs!) and we get:
Isn't that amazingly simple?
Analysis of the analysis
That's about 1.17741 times and that explains our initial observation.
It also explains clearly where the ln(4) comes from. It's actually -2.ln(0.5) and the 2 comes from the Taylor Expansion of log, and the 0.5 is the probability.
Our formula is also pretty accurate, even for only moderate values of N. For N=365, our original birthday problem, that's just a shade under 22.5, whereas the interpolated answer is 22.77. If we pretend that we have 1000 days in a year, the interpolated true answer is 37.5 people, and the formula gives 37.233.
So if we have a pool of N items and we select from them at random, with the possibility of repeats, by the time we have selected 1.2 times items we have a 50% chance of a collision. This is relevant when designing hash tables in computing, and using hashes to represent items.
Our formula gives us more, though, because we can substitute any given desired probability of a clash. Suppose you want a 90% chance of no collision. Then
For a million items, we have a 10% chance of no collisions with 2146 selections, we have a 50% chance of no collisions with 1178 selections, and if you want 90% chance of no collisions then you can only select 459 items.
That's quite small, which is why hash spaces have to be so huge to avoid collisions. It used to be quite common to use 40-bit CRC checks, but with only a million objects there's a 36% chance of a hash collision. Even using 64 bit hashes it only takes a billion objects to have a 2.6% chance of a collision, and 2 billion to have a 10% chance of a collision.
In summary, choosing from N items, with N large, and wanting a probability of T of having no collisions, how many items can you choose at random with replacement?
My thanks (in no particular order) to Wendy Grossman ( http://www.pelicancrossing.net ), Patrick Chkoreff, David Bedford, @ImC0rmac, @tombutton, @Risk_Carver, @standupmaths, @hornmaths, @snapey1979, and @pozorvlak for comments on early drafts.
Added in edit: Someone at last has found the deliberate error - congratulations to @jimmykiselak. I'll leave it in place for now for others to ponder over.
I've decided no longer to include comments directly via the Disqus (or any other) system. Instead, I'd be more than delighted to get emails from people who wish to make comments or engage in discussion. Comments will then be integrated into the page as and when they are appropriate.
If the number of emails/comments gets too large to handle then I might return to a semi-automated system. We'll see.
Links on this page