Saturday, November 04, 2006

Is reliability harmful?

Over time I've become increasingly concerned about the 'acquisition' of non-critical technologies for life/mission-critical purposes. I view this trend, and the lack of rational thought that is applied as a serious trend in the use of technology within society.

The story is usually the same.

First comes the idea. It's usually a bright idea, often from a bunch of off the wall researchers. Implementations are concept demonstrators, developed for research flexibility and rapid implementations.

Second comes the commercial application. Often, this looks disturbingly like the concept demonstrator, if we're lucky it's a hardened implementation that has actually been tested.

Third comes the impact of network economics. The value of the product to the user = n^2 for the number of users, n. This results in rapid explosive growth. During this time the technology 'crosses the chasm' into a serious product. However, it has to maintain backwards compatibility, the golden handcuffs click into place. The technology's fate is all but sealed for many years. (e.g. the address space of IP)

Forth comes acceptance. The technology becomes integrated into the way the society that uses exists. We start relying on the technology, expecting it to be there, 24/7/365.

In phase 1, the system is so unreliable that it barely works outside the lab, there is no perceived threat. It's so difficult to get the darn thing working, that no-one really cares about large-scale reliability.

Phase 2, it's a nascent technology, it's a cool gadget, no-one would rely on it.

Phase 3, OK, so it's a commercial technology, it works almost everytime that you turn it on. If it doesn't people put in a lot of effort to make it work.

Phase 4, it's too late. Huge effort is poured into engineering reliability in the system, If we're unlucky people die because our reliance on the technology, but society cannot excise the technology even if it wanted. People can be told not to rely on it all you like, it will do no good.

So, without any serious consideration being given, we've gone from a research toy, to something that your life may depend on, if you're lucky it might just be something that will inconvenience you or loose you money if it's not there.

Consider some examples:

Mild: Digital photography

Do you rely on your digital cameras? How about your computer? How would you feel if it all went away, tomorrow. Are you sure you don't rely on digital photography? As far as I can tell, lots of people do.

More serious example:

When you dial 999 (UK) or 911 (US) do you expect anyone to answer the phone? Why? You're relying on a system with multiple single points of failure, but it's been around for a while, people understand that we need it, a serious amount of effort has gone into maintenance of uptime, and there are various hacks which attempt to give emergency traffic priority.

Don't delude yourself, the availability of the emergency services is not guaranteed by the technology, you are relying on a high-reliability statistic, and that is all.

There are multiple ways for people to Denial Of Service attack the emergency services, we can but thank our luck that none of them appear to have been seriously exploited, yet. This is evidence based (which obviously cannot be discussed) and not merely speculation based.

And this is a mature technology!

Mobile telephony was not intended for emergency sensitive use. It was not supposed to carry life-critical traffic, and yet it now does. If anything I suspect that people rely on mobile phones more than landlines in many places now.

People in Downing College complain loudly when the Internet is unavailable for 1.5 days in a month? Comments like "I can't do without it", "I feel lost", "I can't get on with my work". Estimates as to impact of actual failure of the Internet include economic collapse of event tangentially related companies like insurance companies within a week.

Not bad, for a technology described by a networking expert as "a toy that no-one was ever supposed to actually use".

Also Consider also that most of the Internet runs on PCs, running Windows and UNIX, the very things that you're concerned about the reliability of when they store your work/photos.......

The same applies all over the place. Even the reliability paranoid military have struggled to resist the appeal of consumer end products.

--

The same attitude applies to software.

I have worked on multiple projects that were not life-critical certified, but because they almost never failed, people grew to depend on them. In the bad cases you hear things like "It's OK, they'll fall back to paper if it fails"

Even in the more optimistic cases when they had backup systems, they aren't used for years.

Even if we assume the optimistic case, that failures are randomly distributed, can you remember how to do something that you haven't done for 5 years? Can you remember how manage styles in MS Word 2.0 without looking?

Can you remember how OLE embedding worked in Windows 3.0?

And this is the ideal situation. Realistically during failures you're likely to would be working under stress possibly in a hostile, noisy, distracting environment. Experiments show that humans get almost everything that isn't purely internalised wrong in such circumstances (E.g. Three Mile Island)

My point is this:

"This technology is not certified for life critical use" is an exercise in blame management, it is technically and sociologically vacuous.

And the implication for design?

Should we deliberately design our systems to periodically fail to make sure they're not being relied on? I gather Tescos does this, every year they run an unscheduled real disaster test by pulling certain power cables.

I think that we should at least think about it.....