My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:
2017/07/28 - Timing Anomaly in "grep"
Some considerable time ago - long enough that probably the details are no longer relevant, or accurate - I got seriously fed up with the spam filtering on my email service. There were so many false positives that I had to wade through the spam bin every day, pulling out obviously good emails that needed attention. It got so that it cost me more time to have it running.
So I disabled it, and implemented my own based on Paul Graham's article: A Plan For Spam
No doubt there are better schemes commercially available now, but it was phenomenally successful, and still runs perfectly. I have about two false positives a month, and it saves me wading through the scum and dross that I get, having been visible on the interwebs now for decades.
There are many tweaks and enhancements, and I run a "White List" as well. People with whom I've interactions get automatically put on the white list as a matter of course, and that helps to seed the data nicely.
But recently I noticed something interesting - the whitelist processing was taking a huge amount of time. Stupidly large. So I ran some timing tests. These are quick tests done on real data.
First, let's just run the whitelist check:
$ time grep -irlf whitelist good_confirmed | wc 47 47 1645 real 2m23.705s user 2m16.488s sys 0m0.240s
That's taking over 2 minutes to find the whitelisted messages. Well, OK, maybe it has a lot to do. But then I split the whitelist file into 10 separate files, each with a tenth of the entries:
$ time ( for f in white_* ; do grep -irlf $f good_confirmed ; done ) | wc 55 55 1925 real 0m7.054s user 0m6.576s sys 0m0.072s
time grep -irlf <( sort white_* ) good_confirmed | wc 47 47 1645 real 2m19.213s user 2m11.036s sys 0m0.324s
OK, using exactly the same files it takes 15 times longer to process them as a single file than it does to run 10 separate times. That's got to be a problem somewhere.
One day I'd love to chase this down, but for now I'll leave this here so I can come back to it when I get time. If you have any thoughts or comments, I'd love to hear them.
Adam Atkinson has been searching about to see if this is known and has found these:
Now to investigate properly ...
Send us a comment ...
Links on this page