Software Checklist

Recent changes
Table of contents
Links to this page


My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:

Software Checklist

During the second World War, fighter pilots would scramble to take off. As they bumbled down the grass runway, engines open at full throttle, trying to take off on a short, bumpy track with a full load of fuel and ammunition, their heart would stop when the engine mis-fired. Was the fuel mix too rich, or too lean? They'd look at the control for the mixture and wonder which way to turn it. The right way would increase engine power and make lift-off straight-forward. The wrong way would lose power, and there was rarely enough time to fix the mistake.

Airplanes got broken, and pilots died.

Until someone worked out that the thing to do was deliberately set the mixture too lean. Then when the engine mis-fired you knew which way to turn the control. The news spread fast, fewer 'planes got broken, and pilots stopped dying on the runway before they'd even got into the air.

Or so I've been told. I'd love to check this story, but WWII pilots are a bit thin on the ground now.
If you know better,
please let me know.
It's considered plausible, though, by my uncle, who, as a hobby, is an aerobatics pilot. I'm not a pilot, and it's always interesting to hear him talk about it.

Then one day he said something fascinating:

"The main reason air accidents are
so rare is the ubiquitous and
relentless use of checklists."

He went on to explain that every time there was an accident, no matter what the cause, the final action was to review the checklists to see if that cause could be prevented. Sometimes it wasn't possible, but sometimes it was. Just like the pilots deliberately setting their mixtures too lean, the knowledge of how to prevent problems was systematically encoded into the procedures they follow.

Knowledge preserved, ready for use.

I remember as a kid watching science fiction films, or documentaries, seeing the test pilots or astronauts going through endless, endless, trivial operations. "Set this switch to on, set that switch to override, set this other thing to off" on and on they would go until I'd be screaming at the screen "Just Go!" But watching the film "Apollo 13" it hit me again, and talking with Ken Mattingly (Apollo 16, STS-4, STS-51C) really brought it home. Many of their actions in that emergency were in the book, because they had done countless hours in the simulators, against adversarial programmers, and they had devised procedures to deal with it. Even the idea of using the Lunar Module as a lifeboat had been discussed and simulated, although in the film it looks like they made it up on the spot.

The bit where they did the burn without their navigation computer, just using the view of the Earth and the terminator? They'd practised that before, and it was a documented procedure. It was in the book.

Knowledge preserved, ready for use.

And it's not just in rockets, where a mistake will literally kill you. It's also in more mundane places. I recently spoke at an event in a theatre. There, backstage, was a large trestle table, and on that table was a white cloth, and drawn on the cloth was the outline of every prop they would need in the show. The absence of a prop would be instantly obvious to everyone - there would be a gap. I'm in the process of adopting this for my travels - a tea-towel with a "shadow board" for the things I'm carrying. I can see at a glance if I'm missing something.

Knowledge preserved, ready for use.

I find this astonishing, but it's a recent phenomenon that surgeons are being forced to use explicit checklists in their operations, and especially when closing up. For years they have resisted this, but the evidence is clear - doing this saves lives and reduces post-operative complications.

"We don't forget things, we don't need this," they say. But they do, and they do. And it saves lives. Why would they resist? I honestly don't know. It's annoying, it's tedious, it's an inconvenience, but it helps to avoid mistakes.

It saves lives.

Wouldn't you do it?

Maybe not.
There was a huge furore about the Heartbleed Bug. One of the most fundamental pieces of software used in communicating between computers had an exploitable flaw. There is an XKCD that explains it beautifully for non-technical people, and, quite frankly, for technical people too. The rush is on to fix the bug (done) deploy the fix (mostly done) and recover the situation (we'll see).

So now what?

Well, various groups are undertaking large scale reviews of existing code to try to unearth any similar problems. Large scale, in depth audits of code are being planned, and people are moving to re-examine existing code to see if it has problems.

That's important.

But how are we saving the knowledge? How are we encoding the lessons learned? How are we making the experience available to those who come after us?

How are we embedding this new knowledge into the fabric of what we do?

What is the equivalent of our trestle table with the shadow-board?

What is our equivalent of the checklist?

Here's one idea. The Heartbleed bug (as I understand it) would have been exposed by the following:

  • Replace calls to malloc in the server by calls to a memory allocator that filled memory with a known, unusual pattern of bytes (for example 0xDEADBEEF or similar)

  • Run a fuzzer at the client end. Don't use calls that obey the rules - send carefully crafted semi-randomised queries.

  • Look in the replies for the known pattern.

This can be automated! This could be mandated as a procedure to be gone through before the code is released. And doing this would prevent this kind of bug from ever happening again. And if it did happen again, we could expand our testing processes to include that condition, and then that couldn't happen again.

Imagine taking all the lessons we've learned over the past twenty or thirty or forty years and imagine having them in an automated check list. Buffer overruns, off-by-one errors, uninitialised variables. Already compilers warn you if there's a chance you used "=" when you meant "==". (If only JavaScript had a way of knowing if you meant "===" instead of "==").

Shadowed methods and variables, constants over-written by aliased references, using a list instead of a generator, copying a reference to a list instead of copying the list.

How much knowledge has been lost
because it wasn't captured by and
encoded into automated tests?

How much of your knowledge is being wasted when it could be shared? Maybe you never make mistakes, but how are you helping others to avoid them?

Maybe people don't die when you make mistakes. Maybe you don't make mistakes. The surgeons claim not to make mistakes, and yet the evidence is irrefutable as to the value of the checklists.

What about you? Could you benefit from other people's captured, encoded, formalised knowledge? Are you willing to learn from and never repeat other people's mistakes?

Are you willing to share?

How can we make this happen?

Can we make this happen?

<<<< Prev <<<<
NASA Space Crews
>>>> Next >>>>
Fill In The Gaps

You should follow me on twitter @ColinTheMathmo


I've decided no longer to include comments directly via the Disqus (or any other) system. Instead, I'd be more than delighted to get emails from people who wish to make comments or engage in discussion. Comments will then be integrated into the page as and when they are appropriate.

If the number of emails/comments gets too large to handle then I might return to a semi-automated system. We'll see.



Links on this page

Site hosted by Colin and Rachel Wright:
  • Maths, Design, Juggling, Computing,
  • Embroidery, Proof-reading,
  • and other clever stuff.

Suggest a change ( <-- What does this mean?) / Send me email
Front Page / All pages by date / Site overview / Top of page

Universally Browser Friendly     Quotation from
Tim Berners-Lee
    Valid HTML 3.2!