Summing up - take a break.

Recently I was reading about approximating images with collections of overlapping translucent polygons (evolution of the Mona Lisa) or triangles (Mostly maths approximates images) and I thought it would be interesting to write a combination shotgun hill-climber and genetic algorithm. Nothing complex, nothing too intricate, but just an experiment to see how closely I could approximate an image with a very small number of triangles.

Results have been interesting and I'll write about them later. But on the way I had a bizarre experience.

I was in the middle of writing the code when I realised that I was on an elderly machine (one I could simply leave running for days) and the version of Python I was using didn't have the built-in "sum". So I quickly wrote one to add to my utilities collection:

    def sum(vector):
        if 0==len(vector): return 0
        return vector[0]+sum(vector[1:])

This is the obvious, classic, recursive functional solution, and it tripped off my fingers without thinking. I was, at the time, programming mostly machines with 256 processors or more, so parallel algorithms were always the first thing I looked at.

I should perhaps have used an accumulation parameter:

    def sum(vector,acc):
        if 0==len(vector): return acc
        return sum(vector[1:],acc+vector[0])

Proper tail-call optimisation (TCO) would turn that into an iterative version and it would run well.

But of course, my version of Python doesn't have TCO, so the first time I called it in anger it blew the stack.

Bother.

Easy to fix, though, using the classic "divide and conquer" technique:

    def sum(vector):
        size = len(vector)
        if 0==size: return 0
        if 1==size: return vector[0]
        half = size/2
        return sum(vector[:half]) + sum(vector[half:])

This can't use TCO, but I don't have that so it doesn't matter. It should prevent the stack from blowing up and run reasonably smartly.

And it does. It runs fairly well - here are some results:

The first column shows the size of the vector, the middle column is the raw wall-clock time (in seconds) and the final column is time/element in micro-seconds. We can see here that the time per element is increasing slowly, so the routine isn't linear in time, but that's not surprising as we have a small amount of work for the calls and the calculation of the vector size (which could be optimised) but it's pretty good.

Size Time Ratio
508 0.00 3.82
1524 0.01 3.88
4572 0.02 3.98
13717 0.07 4.82
41152 0.18 4.47
123456 0.58 4.67
370370 1.75 4.73
1111111 5.57 5.01
3333333 16.96 5.09
9999999 51.50 5.15

Moving on ...

But then I had a thought ...

Rather than recursing by cutting in half, summing each half, and adding the results, why not pair off the elements, add the pairs to get a list half the length, then sum that vector. Like this ...

    def sum(vector):
        size = len(vector)
        if size>1:
            m = size/2
            m2 = m+m
            z = zip(vector[:m],vector[m:m2])
            h = [ a+b for (a,b) in z ]
            return sum(h)+sum(vector[m2:])
        if size==1: return vector[0]
        return 0

If the vector is of even length then we simply zip together the first and second halves, add the pairs, then recurse. If the vector is of odd length then we have to add on the extra element. I don't bother with the check, simply calling "sum" again on the vector from m2 onwards, which might be empty.

Timing this one was interesting:

Things go OK for a bit, not linear but not too bad, but then suddenly blows up enormously. I wondered if the system was hanging on to the intermediate values z and h and the memory was blowing up, so I quickly folded those in:

Size Time Ratio
508 0.00 1.27
1524 0.00 1.07
4572 0.01 1.31
13717 0.02 1.62
41152 0.07 1.67
123456 0.24 1.97
370370 0.92 2.47
1111111 4.70 4.23
3333333 29.10 8.73
9999999 231.91 23.19

    def sum(l):
        size = len(l)
        if 1<size:
            m = size/2
            return sum( [ a+b for (a,b) in zip(l[:m],l[m:m+m]) ] )+sum(l[m+m:])
        if 1==size: return l[0]
        return 0

It was better, but not a huge amount. I won't bother exploring that in detail.

Iteration instead of recursion

Even without TCO, we can still convert the recursive version into an iterative version:

    def sum(vector):
        r = 0
        size = len(vector)
        while size>0:
            m = size/2
            m2= m+m
            if m2<size: r += vector[-1]
            h = [ a+b for (a,b) in zip(vector[:m],vector[m:m2]) ]
            vector,size = h,m
        return r

I have to say I quite liked this - it turned out rather better than I had expected. Sadly, it runs in almost exactly the same speed as the recursive version.

Interesting.

Moving into built-ins ...

Of course, so far this is all running in Python, but there are built-ins that are written in C and run rather faster. The obvious one to use is the one from functional programming, the routine "reduce":

    def sum(l):
        return reduce(lambda a,b:a+b,l)

Now this is much faster. here are the result, compared with the previous:

We can see here that the times are pretty constant as the list grows, actually getting faster per element as the vector gets bigger. That's to be expected of a true linear-time algorithm because the setup time gets amortised across more elements.

So now we've got a linear-time routine using a built-in - reduce - so I guess we're pretty much done.

Folding Reduce
Size Time Ratio Time Ratio
508 0.00 1.27 0.0004 0.78
1524 0.00 1.07 0.0012 0.77
4572 0.01 1.31 0.0036 0.78
13717 0.02 1.62 0.0113 0.82
41152 0.07 1.67 0.0330 0.80
123456 0.24 1.97 0.1096 0.89
370370 0.92 2.47 0.3045 0.82
1111111 4.70 4.23 0.9295 0.84
3333333 29.1 8.73 2.9419 0.88
9999999 231.91 23.19 8.3294 0.83

After the break.

Some of you will have seen this coming.

After coding this up and leaving the image approximation running over lunch, I went for a walk, made a coffee, came back and watching the images improving oh ... so ... very ... slowly.

And then I thought, how about this:

    def sum(vector):
        rtn = 0
        for element in vector:
            rtn += element
        return rtn

Yup. That beat the snot out of the other routine, took its lunch money, and swaggered off laughing:

Size Folding Reduce Direct
508 1.27 0.83 0.50
1524 1.07 0.88 0.51
4572 1.31 0.84 0.52
13717 1.62 0.82 0.51
41152 1.67 0.89 0.52
123456 1.97 0.80 0.48
370370 2.47 0.82 0.46
1111111 4.23 0.78 0.47
3333333 8.73 0.77 0.45
9999999 23.19 0.78 0.49

Post-script

The timings were taken using wall-clock time and averages over some number of calls ranging from 4 to 10. I eye-balled the values and decided that there weren't enough anomalies to bother about. Yes, occasionally there was a larger value, but broadly the figures given are all reasonable.

The vector sizes were chosen to be changing by a factor of about 3, and so there's a good mix of odds and evens. After all, most of the routines have tests for odd/even on the size, and I wanted a mix. It seemed a good compromise without much work. After all, this isn't a deep theoretical paper.

The routines aren't intended to be lovely hand-optimised. This was just an observation that I thought I'd make. There are possibly hundreds of thing you could try, small variations that may have surprisingly large effects.

I'd like to see them, so if you think these tests are fatally flawed, please, run your own and tell us about them.