My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:
Graceful Degradation - 2014/12/12
I first learned about graceful degradation from a colleague. He prefaced his story by saying that good people learn from their mistakes, but the best people learn from other people's mistakes. This is a bit like the saying in aviation circles that a good landing is one you can walk away from, an excellent landing is when they can use the 'plane again ...
But I digress. He told me how he learned about graceful degradation, and the message was clear. He was telling me about his mistake so I could learn from it. I wish more people were willing to make their mistakes public - the software industry would certainly benefit if it could learn from everyone's mistakes.
And his story goes, in broad outline (I forget some of the details) like this.
He made a comfortable living from programming electronic tills. At the time these were programmed in assembly language, and the best output you could hope for during debugging was to see whether it opened the till or not. It was difficult work because there were no simulators, no higher level languages, no debuggers, no single-steppers, no help at all, really. You wrote you code in assembler, transferred it to the machine, then tested it to see if it worked. Sometimes it did.
My colleague was amazing at this work, which is why he could get contracts in it any time he wanted, and the money was good. So he worked about six months of the year, then did whatever he wanted the rest of the time.
Except at Christmas. That was when one of his biggest mistakes would come back to haunt him.
A large department store in a major capital city used tills that ran his program. It was one of the most sophisticated systems around. Each till would keep a list of items sold, and then a central computer would interrogate each till in turn and find out how much had been sold of each item. It would then keep track of how many remained in stock, and would project when a re-ordering would be necessary. Re-ordering was not automatic, but staff would be alerted when stock was low on any items and decisions could be made.
They loved the system. Buffer stock could be reduced, reducing the store's overheads, and making the entire store more responsive to consumer demand.
Except at Christmas.
You see, every Christmas the system, in essence, would simply stop. An entire floor of tills would stop responding, refusing to accept further purchases for an unpredictable length of time, and then suddenly start working again. There was no apparent reason, no apparent rhyme, and nothing they could do except call my colleague and get him to come in and "fix the poblem."
But there was really nothing he could do either, even when he worked out what the problem was. And what was the problem?
Each till would keep its list of items sold, and then when asked, would
dump the list to the central computer. Then it would go back to
Entire floors of tills would grind to a halt, waiting to download their data and restart. The floor would simply stop. Not good at the busiest shopping time of the year. Really not good.
The patch that was run at Christmas was to make each till run slightly slow. There was still the occasional "I'm full so I'm stopping" moment, but in general the system would limp along and not exhibit the catastrophic failure of entire floors stopping. The difficulty was in judging exactly how slow to make the tills run. He also made the central computer contact the tills in a clever, pseudo-random way, and that meant that if one till stopped on a floor, the others would probably still be working.
So what should have been done?
Well, apart from having a faster communications system that could deal with the load being placed on it, each till could notice when its queue was 80% full and start slowing down. Then the performance of the system as a whole would degrade gracefully. Sales would slow slightly, evenly across the store, and come to match the processing throughput.
Of course it would have been better if the system had simply been faster, but when overloaded, degrading gracefully is usually a better option than simply stopping. Think about it next time you worry about system capacity.
Is it better to halt and clear the backlog, or degrade gracefully and continue to serve your customers?
I've decided no longer to include comments directly via the Disqus (or any other) system. Instead, I'd be more than delighted to get emails from people who wish to make comments or engage in discussion. Comments will then be integrated into the page as and when they are appropriate.
If the number of emails/comments gets too large to handle then I might return to a semi-automated system. That's looking increasingly unlikely.
Links on this page