Grep Timing Anomaly

Recent changes

Table of contents

Links to this page

FRONT PAGE / INDEX

Subscribe!

My latest posts can be found here:

Colins Blog

Previous blog posts:

Additionally, some earlier writings:

2017/07/28 - Timing Anomaly in "grep"

Some considerable time ago - long enough that probably the details are no longer relevant, or accurate - I got seriously fed up with the spam filtering on my email service. There were so many false positives that I had to wade through the spam bin every day, pulling out obviously good emails that needed attention. It got so that it cost me more time to have it running.

So I disabled it, and implemented my own based on Paul Graham's article: A Plan For Spam

No doubt there are better schemes commercially available now, but it was phenomenally successful, and still runs perfectly. I have about two false positives a month, and it saves me wading through the scum and dross that I get, having been visible on the interwebs now for decades.

There are many tweaks and enhancements, and I run a "White List" as well. People with whom I've interactions get automatically put on the white list as a matter of course, and that helps to seed the data nicely.

But recently I noticed something interesting - the whitelist processing was taking a huge amount of time. Stupidly large. So I ran some timing tests. These are quick tests done on real data.

First, let's just run the whitelist check:

 
$ time grep -irlf whitelist good_confirmed | wc

     47      47    1645

real	2m23.705s
user	2m16.488s
sys	0m0.240s

That's taking over 2 minutes to find the whitelisted messages. Well, OK, maybe it has a lot to do. But then I split the whitelist file into 10 separate files, each with a tenth of the entries:

 
$ time ( for f in white_* ; do grep -irlf $f good_confirmed ; done ) | wc

     55      55    1925

real	0m7.054s
user	0m6.576s
sys	0m0.072s

The different number of hits is because some files contain hits in more than one whitelist, so that's to be expected.
That's a huge drop in processing time, from over two minutes to under 8 seconds. There's go to be something bizarre about this. Just to run a quick check:

 
time grep -irlf <( sort white_* ) good_confirmed | wc

     47      47    1645

real	2m19.213s
user	2m11.036s
sys	0m0.324s

OK, using exactly the same files it takes 15 times longer to process them as a single file than it does to run 10 separate times. That's got to be a problem somewhere.

One day I'd love to chase this down, but for now I'll leave this here so I can come back to it when I get time. If you have any thoughts or comments, I'd love to hear them.

Adam Atkinson has been searching about to see if this is known and has found these:

Also, a commenter found this: