Thursday, July 13, 2006

When you're standing in deep water

About 18 months ago, around the 10th of Jan, 2005, I experienced what I would now euphemistically describe as an 'unmanaged data loss incident'. My secondary hard disk died, whilst copying data to a DVD. I lost ~1000 photos. A healthy dose of black magic and many tense hours with a data recovery tool resurrected ~95% of the data.

I learnt many lessons that day, several of which have been repeatedly confirmed over the intervening year and a half. This year, as a Student IT person I have seen several upset people with failed machines.

To summarise the lessons learnt, from this and security assurance work:

Backup the right things

People backup what they think they would be upset about if they lost. This is often the wrong thing. People backup things they perceive as having financial value, such as work, but when the data goes away, they care more about their photos.

Things with emotional value require backups as well.

Make it easy

If your backups aren’t quick and easy to do, you won’t do them. We see this in security time and time again, if security features get in the way, they’re going to get ‘temporarily disabled’. It’s amazing how long temporarily can be.

Viscosity is your worst enemy. I will have more to say about this another day.

Manage the data volumes

‘But I’ve got so many photos, I can’t backup them’

Tell me about it. I currently have 205Gb of original and derived files, increasing at 20-50Gb a month, dependent on the number of jobs I do and how much time I have.

I’ve been using a split Originals | Processed structure with the obvious (to a pre-metadata age Compsci), approach of putting each shoot in a folder with a hierarchical (Year – Month – Date & Title). Backups were run based on a list of folders on each incrementally numbered DVD. However this approach is no longer sufficient:

-> I often have folders which exceed 4.7Gb.

-> I now shoot with multiple cameras, with intersecting namespaces.

-> My files names have started wrapping around, making filenames non-unique, even from the same camera.

-> Finding files by date is becoming increasingly difficult.

-> Synchronising derived files in backups is effectively impossible. I am creating derived files of originals that are over a year old, this means tracing which DVD backups are now out of date and replacing them, over splitting them into multiple DVDs. This is inefficient in terms to time and consumed volumes of media. Restoration from a serious fire would be terrible at the moment, probably involving 100+ hours of work.

So, it’s time to migrate to a Metadata Driven Workflow. This is my first use of metadata in anger, and I’m concerned as to how well it will work. The basic idea is:

Camera -> Download folders. (Fast temp backup to secondary disk)
Merge/Sort -> Task folder (Merge based on data-time-serial, to prevent file name collisions from different cameras)
Working folder backup (Network drive temp backup)

--
Rate, Apply Metadata, Apply Camera Raw settings
--

Split RAW into incremental ‘data buckets’, each of which will fit on a DVD
Create DNG files from RAW, split to DNG buckets
Burn RAW buckets to DVD (Store offsite)
Burn DNG buckets to DVD (Separate at separate offsite location)
Move DNG into archive or process to derived files

There are other various complications, but that’s the general idea. Details and reflections later will be given later. The system is currently in preliminary testing prior to a full migration.

Manage obsolescence

Obvious: DVDs decay, get bleached, suffer heat/moisture exposure. So be prepared to regenerate them, by reburning.

Not so obvious: JPEG files will probably still be readable in 5 years, possibly even 15. I would not like to say the same about Canon CR2, or Sony SRF files. Hence I’m using DNG, which I think, if the worst came to the worst, I could write a parser for.

However, you should plan to systematically migrate data to new formats if you want to be able to painlessly access your data in a decade.

Manage security implications

Several issues here. One that is globally relevant, others that may be less so.

Attacks to a local machine, be they viruses, or something more serious can cause soft damage to files. These may not be detected, propagated into files which are then backed up, so instead of backing up the data, the corruption and the virus are being backed up. User error may cause similar problems.

[Removes rant on how people who ignore user error in managing backups should be banned from working in the area :-)]

This problem can be helped by using ‘immutable media’ such as DVD, which can’t be practically changed. However it only protects from when the DVD is written. General good computing techniques are also required, run as a low privilege user, run machine protection software etc. etc.

And then there’s this problem’s big nasty brother, confidentiality. For many, this will be less of a problem, but I manage a lot of confidential and protected information. It’s no use in making data accessible in the advent of a serious problem, if all an attacker needs to do to get the data is intercept one of these streams, or compromise a machine, or force a system into a recovery mode. Backup/Failover systems often interact with security in a seriously bad way. I will have more to say on this in the future as well.

I am not going to discuss my solution to this, but suffice to say, it is amongst the hardest of the problem, calling for a sophisticated network protection architecture, strong applications of crypto to move trust around, and aggressive audit logging.

Disaster Recovery Simulations

I don’t trust anything until I’ve tested it. I’m planning my first major disaster recovery test shortly after the migration completes later this month. Time to find my mistakes in a controlled manner :-)

Threat model

But, having said all this, be realistic, understand the threats, the potential harm and choose sensible solutions. There’s no point in spending all your time doing backups, rather than producing work to backup.

I have multiple scenarios and goals, the non-security threat model is:

Threat: Single hardware failure (e.g. Disk crash)
Maximum accepted loss: 24 hours work

Threat: Location failure (e.g. Fire, flood, lightning)
Maximum accepted loss: 7 days work

I consider multiple location failure to be a security problem, not a reliability one.

So that’s the idea, now we’ll see how well it will go.

The estimated initial cost (ignoring the time and ongoing maintenance costs(!!!!)) of all this is:

If the photos are kept unprocessed, 60p per GB of originals (CR2)
If the photos are processed, £3 per GB of originals (CR2)

I thought the film was supposed to be free with digital cameras?