Subscribe!
My latest posts can be
found here:
Previous blog posts:
Additionally, some earlier writings:

When optimising code, never guess, always measure.
This is a truism that lots of people quote, but it can be hard to
remember, especially in the heat of battle (as it were). Rather
fortunately it came to mind just when needed, as I found something
completely unexpected.
I was writing a simple implementation of the Fermat difference of
squares method of factoring. This involves writing the number to
be factored as  you guessed it  the difference of two squares.
If $n=a^2b^2$ then $n=(ab)(a+b)$ and we have a factorisation
(provided we don't have $ab=1$).
Version 0
So the obvious thing to do is to take the square root of $n$, round
it up to get a candidate for $a$, then look at $a^2n$. If that's
a square then we're done.

def factor_fermat0(n):
a = square_root_ceil(n)
while True:
d = a*an
b = square_root_ceil(d)
if b**2 == d:
return ab, a+b
a += 1

Version 1
We can improve on this. We can increase $a$ until we have $a^2b^2>n$,
and then increase $b$ until $a^2b^2<n$. If $n$ is odd then we'll get
equality at some point, so we just continue until that happens, and
we're done.

def factor_fermat1( n ):
a = square_root_ceil(n)
b = square_root_ceil(a**2n)
while a**2b**2 != n:
while a**2b**2 < n: a += 1
while a**2b**2 > n: b += 1
return ab, a+b

Version 2
Now we notice that we are repeatedly computing the squares of $a$
and $b$, so we create variables to hold the squares and update them
by remembering that $(x+1)^2=x^2+2x+1$.

def factor_fermat2( n ):
a = square_root_ceil(n)
a2 = a**2
b = square_root_ceil(a**2n)
b2 = b**2
while a2b2 != n:
while a2b2 < n: a2, a = a2+a+a+1, a+1
while a2b2 > n: b2, b = b2+b+b+1, b+1
return ab, a+b

Version 3
We can improve that slightly further by having $c=2a+1$ and $d=2b+1$,
and the code now becomes:

def factor_fermat3( n ):
a = square_root_ceil(n)
a2 = a**2
c = 2*a+1
b = square_root_ceil(a**2n)
b2 = b**2
d = 2*b+1
while a2b2 != n:
while a2b2 < n: a2, c = a2+c, c+2
while a2b2 > n: b2, d = b2+d, d+2
return (cd)/2, (c+d)/21

Let's see how we're doing
These improvements all seem obvious and straightforward, and we can
check by doing some timing tests. The times, and indeed, the ratios
of times, will vary according to the number being factored, so we can
do two different numbers and see what the trend is. If you choose to
analyse the algorithm then you'll see what's going on. However, we
factored two different numbers:
4736906887651 =
1211141
x
3911111:

Routine 
0

1

2

3

Run 0  9.223  0.409  0.258  0.188 
Run 1  9.138  0.408  0.258  0.188 
Run 2  9.146  0.407  0.262  0.194 
Run 3  9.149  0.416  0.258  0.188 
Run 4  9.228  0.408  0.272  0.189 
Run 5  9.187  0.407  0.258  0.188 
Run 6  9.154  0.421  0.259  0.188 
Run 7  9.157  0.407  0.259  0.189 

47368008965429 =
3911111
x
12111139:

Routine 
0

1

2

3

Run 0  34.922  1.230  0.774  0.564 
Run 1  34.276  1.214  0.775  0.562 
Run 2  34.191  1.215  0.777  0.564 
Run 3  34.198  1.214  0.776  0.564 
Run 4  34.182  1.230  0.776  0.563 
Run 5  34.226  1.244  0.775  0.563 
Run 6  34.144  1.235  0.773  0.564 
Run 7  34.104  1.233  0.775  0.565 

We can see that the first routine is rubbish, but that's no surprise,
as it's computing a square root every time through the loop. For the
others we can see a clear improvement as we make those optimisations.
It did surprise me just how big the improvement was from #2 to #3,
but it was nice to see.
Version 4
Then I made a small change to the code layout ready for another shot
at an optimisation. Pretty uncontroversial  I ran the timing tests
again to make sure there was no difference, using a larger number,
and not bothering to use the reference "version 0" since that was so
slow.

def factor_fermat4( n ):
a = square_root_ceil(n)
a2 = a**2
c = 2*a+1
b = square_root_ceil(a**2n)
b2 = b**2
d = 2*b+1
while a2b2 != n:
while a2b2 < n:
a2 += c
c += 2
while a2b2 > n:
b2 += d
d += 2
return (cd)/2, (c+d)/21


47368008965429 =
3911111
x
12111139:

Routine 
1

2

3

4

Run 0  1.225  0.774  0.562  0.622 
Run 1  1.223  0.772  0.562  0.624 
Run 2  1.238  0.776  0.567  0.625 
Run 3  1.233  0.777  0.566  0.626 
Run 4  1.229  0.769  0.563  0.624 
Run 5  1.231  0.773  0.562  0.624 
Run 6  1.225  0.771  0.563  0.625 
Run 7  1.219  0.771  0.563  0.622 

Wait  what? It's slower? How can that be? I'd've thought it would
compile to exactly the same bytecode, and so execute in pretty much the
same time.
Just to be sure I ran it on a bigger number.
4736791755876611 =
39111113
x
121111147:

Routine 
1

2

3

4

Run 0  12.388  7.775  5.672  6.245 
Run 1  12.215  7.767  5.660  6.256 
Run 2  12.295  7.910  5.674  6.301 
Run 3  12.196  7.749  5.645  6.247 
Run 4  12.221  7.731  5.646  6.294 
Run 5  12.227  7.750  5.638  6.232 
Run 6  12.226  7.798  5.691  6.250 
Run 7  12.164  7.753  5.646  6.264 

There is no doubt:
a2, c = a2+c, c+2
is faster than
a2 += c
c += 2

That caught me by surprise, but it really does go to show:
When optimising,
measure, measure,
measure.

My guess is that in the first case the two evaluations and assignments
can happen in parallel, and so may happen on different cores, whereas
in the second case they must happen in serial, but that's just a guess.
Maybe someone reading this knows for sure.
Let me know.
I've got more changes to make, but one thing is for sure  I'll be
measuring the improvements as I go to make sure they really do make
things better.
References:
Comments
The following comment was sent, but I don't know if the person is happy
to be mentioned by name  if it's you, please let me know. However, they
said this (paraphrased):
Using the standard python interpreter the GIL would prevent
a2, c = a2+c, c+2
from executing in parallel.
Another big optimization would be to replace a**2 with a*a. Multiplication
is much faster than power functions < 4. It is possible the interpreter
does this automatically, but I don't think so.

And to whomever it was  thank you for the response.
Send us a comment ...
