Saturday, November 04, 2006

Is reliability harmful?

Over time I've become increasingly concerned about the 'acquisition' of non-critical technologies for life/mission-critical purposes. I view this trend, and the lack of rational thought that is applied as a serious trend in the use of technology within society.

The story is usually the same.

First comes the idea. It's usually a bright idea, often from a bunch of off the wall researchers. Implementations are concept demonstrators, developed for research flexibility and rapid implementations.

Second comes the commercial application. Often, this looks disturbingly like the concept demonstrator, if we're lucky it's a hardened implementation that has actually been tested.

Third comes the impact of network economics. The value of the product to the user = n^2 for the number of users, n. This results in rapid explosive growth. During this time the technology 'crosses the chasm' into a serious product. However, it has to maintain backwards compatibility, the golden handcuffs click into place. The technology's fate is all but sealed for many years. (e.g. the address space of IP)

Forth comes acceptance. The technology becomes integrated into the way the society that uses exists. We start relying on the technology, expecting it to be there, 24/7/365.

In phase 1, the system is so unreliable that it barely works outside the lab, there is no perceived threat. It's so difficult to get the darn thing working, that no-one really cares about large-scale reliability.

Phase 2, it's a nascent technology, it's a cool gadget, no-one would rely on it.

Phase 3, OK, so it's a commercial technology, it works almost everytime that you turn it on. If it doesn't people put in a lot of effort to make it work.

Phase 4, it's too late. Huge effort is poured into engineering reliability in the system, If we're unlucky people die because our reliance on the technology, but society cannot excise the technology even if it wanted. People can be told not to rely on it all you like, it will do no good.

So, without any serious consideration being given, we've gone from a research toy, to something that your life may depend on, if you're lucky it might just be something that will inconvenience you or loose you money if it's not there.

Consider some examples:

Mild: Digital photography

Do you rely on your digital cameras? How about your computer? How would you feel if it all went away, tomorrow. Are you sure you don't rely on digital photography? As far as I can tell, lots of people do.

More serious example:

When you dial 999 (UK) or 911 (US) do you expect anyone to answer the phone? Why? You're relying on a system with multiple single points of failure, but it's been around for a while, people understand that we need it, a serious amount of effort has gone into maintenance of uptime, and there are various hacks which attempt to give emergency traffic priority.

Don't delude yourself, the availability of the emergency services is not guaranteed by the technology, you are relying on a high-reliability statistic, and that is all.

There are multiple ways for people to Denial Of Service attack the emergency services, we can but thank our luck that none of them appear to have been seriously exploited, yet. This is evidence based (which obviously cannot be discussed) and not merely speculation based.

And this is a mature technology!

Mobile telephony was not intended for emergency sensitive use. It was not supposed to carry life-critical traffic, and yet it now does. If anything I suspect that people rely on mobile phones more than landlines in many places now.

People in Downing College complain loudly when the Internet is unavailable for 1.5 days in a month? Comments like "I can't do without it", "I feel lost", "I can't get on with my work". Estimates as to impact of actual failure of the Internet include economic collapse of event tangentially related companies like insurance companies within a week.

Not bad, for a technology described by a networking expert as "a toy that no-one was ever supposed to actually use".

Also Consider also that most of the Internet runs on PCs, running Windows and UNIX, the very things that you're concerned about the reliability of when they store your work/photos.......

The same applies all over the place. Even the reliability paranoid military have struggled to resist the appeal of consumer end products.

--

The same attitude applies to software.

I have worked on multiple projects that were not life-critical certified, but because they almost never failed, people grew to depend on them. In the bad cases you hear things like "It's OK, they'll fall back to paper if it fails"

Even in the more optimistic cases when they had backup systems, they aren't used for years.

Even if we assume the optimistic case, that failures are randomly distributed, can you remember how to do something that you haven't done for 5 years? Can you remember how manage styles in MS Word 2.0 without looking?

Can you remember how OLE embedding worked in Windows 3.0?

And this is the ideal situation. Realistically during failures you're likely to would be working under stress possibly in a hostile, noisy, distracting environment. Experiments show that humans get almost everything that isn't purely internalised wrong in such circumstances (E.g. Three Mile Island)

My point is this:

"This technology is not certified for life critical use" is an exercise in blame management, it is technically and sociologically vacuous.

And the implication for design?

Should we deliberately design our systems to periodically fail to make sure they're not being relied on? I gather Tescos does this, every year they run an unscheduled real disaster test by pulling certain power cables.

I think that we should at least think about it.....

Tuesday, September 05, 2006

I finally understand

It's 01:23 in the morning, I'm sitting waiting for my compiler to finish the latest test case as why my application randomly quits when I modify a CPU flag.

Until tonight, I didn't understand why people complained about the speed of C++ compilers.

I now get it.

Friday, August 18, 2006

Ten years of Cognitive Dimensions

The Special CDs issue of JVLC has been published. :-)

I have a paper co-authored in it with Thomas Green, Ann Blanford, Chris Roast and Steven Clarke.

Full text available here.

(www.sciencedirect.com/science/journal/1045926X)

In it, I argue for the seperation of cognitive and information structural concerns in the Cognitive Dimensions framework, and use the resultant 'refactored' dimension set to bring new clarity to some complex cognitive problems.

The work has directly influenced my current work on Cognitive modelling of Security that I'm currently writing up. I believe that it offers a powerful view, esp. for modelling security critical user interfaces.

Thanks to my co-authors and Alan Blackwell.

Thursday, July 13, 2006

When you're standing in deep water

About 18 months ago, around the 10th of Jan, 2005, I experienced what I would now euphemistically describe as an 'unmanaged data loss incident'. My secondary hard disk died, whilst copying data to a DVD. I lost ~1000 photos. A healthy dose of black magic and many tense hours with a data recovery tool resurrected ~95% of the data.

I learnt many lessons that day, several of which have been repeatedly confirmed over the intervening year and a half. This year, as a Student IT person I have seen several upset people with failed machines.

To summarise the lessons learnt, from this and security assurance work:

Backup the right things

People backup what they think they would be upset about if they lost. This is often the wrong thing. People backup things they perceive as having financial value, such as work, but when the data goes away, they care more about their photos.

Things with emotional value require backups as well.

Make it easy

If your backups aren’t quick and easy to do, you won’t do them. We see this in security time and time again, if security features get in the way, they’re going to get ‘temporarily disabled’. It’s amazing how long temporarily can be.

Viscosity is your worst enemy. I will have more to say about this another day.

Manage the data volumes

‘But I’ve got so many photos, I can’t backup them’

Tell me about it. I currently have 205Gb of original and derived files, increasing at 20-50Gb a month, dependent on the number of jobs I do and how much time I have.

I’ve been using a split Originals | Processed structure with the obvious (to a pre-metadata age Compsci), approach of putting each shoot in a folder with a hierarchical (Year – Month – Date & Title). Backups were run based on a list of folders on each incrementally numbered DVD. However this approach is no longer sufficient:

-> I often have folders which exceed 4.7Gb.

-> I now shoot with multiple cameras, with intersecting namespaces.

-> My files names have started wrapping around, making filenames non-unique, even from the same camera.

-> Finding files by date is becoming increasingly difficult.

-> Synchronising derived files in backups is effectively impossible. I am creating derived files of originals that are over a year old, this means tracing which DVD backups are now out of date and replacing them, over splitting them into multiple DVDs. This is inefficient in terms to time and consumed volumes of media. Restoration from a serious fire would be terrible at the moment, probably involving 100+ hours of work.

So, it’s time to migrate to a Metadata Driven Workflow. This is my first use of metadata in anger, and I’m concerned as to how well it will work. The basic idea is:

Camera -> Download folders. (Fast temp backup to secondary disk)
Merge/Sort -> Task folder (Merge based on data-time-serial, to prevent file name collisions from different cameras)
Working folder backup (Network drive temp backup)

--
Rate, Apply Metadata, Apply Camera Raw settings
--

Split RAW into incremental ‘data buckets’, each of which will fit on a DVD
Create DNG files from RAW, split to DNG buckets
Burn RAW buckets to DVD (Store offsite)
Burn DNG buckets to DVD (Separate at separate offsite location)
Move DNG into archive or process to derived files

There are other various complications, but that’s the general idea. Details and reflections later will be given later. The system is currently in preliminary testing prior to a full migration.

Manage obsolescence

Obvious: DVDs decay, get bleached, suffer heat/moisture exposure. So be prepared to regenerate them, by reburning.

Not so obvious: JPEG files will probably still be readable in 5 years, possibly even 15. I would not like to say the same about Canon CR2, or Sony SRF files. Hence I’m using DNG, which I think, if the worst came to the worst, I could write a parser for.

However, you should plan to systematically migrate data to new formats if you want to be able to painlessly access your data in a decade.

Manage security implications

Several issues here. One that is globally relevant, others that may be less so.

Attacks to a local machine, be they viruses, or something more serious can cause soft damage to files. These may not be detected, propagated into files which are then backed up, so instead of backing up the data, the corruption and the virus are being backed up. User error may cause similar problems.

[Removes rant on how people who ignore user error in managing backups should be banned from working in the area :-)]

This problem can be helped by using ‘immutable media’ such as DVD, which can’t be practically changed. However it only protects from when the DVD is written. General good computing techniques are also required, run as a low privilege user, run machine protection software etc. etc.

And then there’s this problem’s big nasty brother, confidentiality. For many, this will be less of a problem, but I manage a lot of confidential and protected information. It’s no use in making data accessible in the advent of a serious problem, if all an attacker needs to do to get the data is intercept one of these streams, or compromise a machine, or force a system into a recovery mode. Backup/Failover systems often interact with security in a seriously bad way. I will have more to say on this in the future as well.

I am not going to discuss my solution to this, but suffice to say, it is amongst the hardest of the problem, calling for a sophisticated network protection architecture, strong applications of crypto to move trust around, and aggressive audit logging.

Disaster Recovery Simulations

I don’t trust anything until I’ve tested it. I’m planning my first major disaster recovery test shortly after the migration completes later this month. Time to find my mistakes in a controlled manner :-)

Threat model

But, having said all this, be realistic, understand the threats, the potential harm and choose sensible solutions. There’s no point in spending all your time doing backups, rather than producing work to backup.

I have multiple scenarios and goals, the non-security threat model is:

Threat: Single hardware failure (e.g. Disk crash)
Maximum accepted loss: 24 hours work

Threat: Location failure (e.g. Fire, flood, lightning)
Maximum accepted loss: 7 days work

I consider multiple location failure to be a security problem, not a reliability one.

So that’s the idea, now we’ll see how well it will go.

The estimated initial cost (ignoring the time and ongoing maintenance costs(!!!!)) of all this is:

If the photos are kept unprocessed, 60p per GB of originals (CR2)
If the photos are processed, £3 per GB of originals (CR2)

I thought the film was supposed to be free with digital cameras?

Wednesday, June 14, 2006

MetaBlogging

Inspired by the sterling examples provided by others and further others I have finally made a fledgling entry into the world of blogging.

In a moment of revision induced boredom, combined with the guilt of ignoring Marc Eisenstadt for too long, and my replies to others' blogs becoming as long as a blog in itself, I decided, reluctantly, that I will blog.

From there it became a question of choosing a provider. I could handcode it myself?

Rejected; out of laziness, incompetence and a general admittance that I'm very afraid of writing web-software. Besides, it seems like a very economically dubious thing to do...

The idea of letting other people run bits of my website for me, is still somewhat alien to me, despite my generally positive and always professional experiences over at Smugmug it is still with some trepidation, not to mention outright terror that I let someone else control a fraction of a website that contains my content.

Wordpress seemed nice, but inflexible unless I was hosting it myself. My hosting infrastructure is all set up to handle ASP.NET not PHP, and again the knowledge of my ignorance, and general reluctance to spend yet more on hosting counted against this.

So Blogger it is, for now at least...

This whole thing seems unclear to me:

  • If I do blog, will anyone read?

  • Can I blog about most of what I find interesting?

  • Do I wish to blog about much of what I find interesting?

  • How does blogging fit in between my carefully managed opinions for the world, and the ranting discourses that my friends kindly suffer?

  • How will a blog cope with being interspersed by comments on garden parties, and discussions of information efficient communication by SMS?

  • Blah....



A Brave New World indeed....