Grep Timing Anomaly

   
Recent changes
Table of contents
Links to this page
FRONT PAGE / INDEX

Subscribe!
@ColinTheMathmo

My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:

2017/07/28 - Timing Anomaly in "grep"

Some considerable time ago - long enough that probably the details are no longer relevant, or accurate - I got seriously fed up with the spam filtering on my email service. There were so many false positives that I had to wade through the spam bin every day, pulling out obviously good emails that needed attention. It got so that it cost me more time to have it running.

So I disabled it, and implemented my own based on Paul Graham's article: A Plan For Spam

No doubt there are better schemes commercially available now, but it was phenomenally successful, and still runs perfectly. I have about two false positives a month, and it saves me wading through the scum and dross that I get, having been visible on the interwebs now for decades.

There are many tweaks and enhancements, and I run a "White List" as well. People with whom I've interactions get automatically put on the white list as a matter of course, and that helps to seed the data nicely.

But recently I noticed something interesting - the whitelist processing was taking a huge amount of time. Stupidly large. So I ran some timing tests. These are quick tests done on real data.

First, let's just run the whitelist check:

 
$ time grep -irlf whitelist good_confirmed | wc

     47      47    1645

real	2m23.705s
user	2m16.488s
sys	0m0.240s

That's taking over 2 minutes to find the whitelisted messages. Well, OK, maybe it has a lot to do. But then I split the whitelist file into 10 separate files, each with a tenth of the entries:

 
$ time ( for f in white_* ; do grep -irlf $f good_confirmed ; done ) | wc

     55      55    1925

real	0m7.054s
user	0m6.576s
sys	0m0.072s

The different number of hits is because some files contain hits in more than one whitelist, so that's to be expected.
That's a huge drop in processing time, from over two minutes to under 8 seconds. There's go to be something bizarre about this. Just to run a quick check:

 
time grep -irlf <( sort white_* ) good_confirmed | wc

     47      47    1645

real	2m19.213s
user	2m11.036s
sys	0m0.324s

OK, using exactly the same files it takes 15 times longer to process them as a single file than it does to run 10 separate times. That's got to be a problem somewhere.

One day I'd love to chase this down, but for now I'll leave this here so I can come back to it when I get time. If you have any thoughts or comments, I'd love to hear them.


Adam Atkinson has been searching about to see if this is known and has found these:

Also, a commenter found this:

Now to investigate properly ...


<<<< Prev <<<<
The Radius Of The Earth
:
>>>> Next >>>>
Radius Of The Earth Part Two ...


https://mathstodon.xyz/@ColinTheMathmo You can follow me on Mathstodon.



Of course, you can also
follow me on twitter:

@ColinTheMathmo


Send us a comment ...

You can send us a message here. It doesn't get published, it just sends us an email, and is an easy way to ask any questions, or make any comments, without having to send a separate email. So just fill in the boxes and then

Your name :
Email :
Message :


Contents

 

Links on this page

 
Site hosted by Colin and Rachel Wright:
  • Maths, Design, Juggling, Computing,
  • Embroidery, Proof-reading,
  • and other clever stuff.

Suggest a change ( <-- What does this mean?) / Send me email
Front Page / All pages by date / Site overview / Top of page

Universally Browser Friendly     Quotation from
Tim Berners-Lee
    Valid HTML 3.2!