Monday, October 12, 2009

Backup failures on small servers

In light of a catastrophic data loss @ T-Mobile, I went to verify backup setups I am responsible of and found out that one of them wasn't working as expected. Damn it. That's what happens when you have very little resources for a custom solution.

This slashdot comment on the T-Mobile issue is particularly relevant:

As a wise auditor once told me: You can outsource the work, but you can not outsource the responsibility.

If your data is important to you - you must back it up, and you must test your backups.


But how can you say that to a customer, when the customer is small and not specialized in IT, and has low budget? Here's part of my mail to them.

Hei,

I went to check the backup solution and found out it hadn't been working properly for a while. I fixed it. [...]

So I wanted to remind you that the backup solution I have implemented is very simple and not full proof. The main reasons for this are:
* I rely on other parties (hosting company)
* risk management is also a question of resources. And I've allocated the minimum of resources to this setup
* backup isn't my specialty, even thought I know my share

In theory one should verify backups often, etc... but I 've had to make a tradeoff between costs and results. I verify it once in a while, but I haven't put a solution in place that notifies me automatically if something goes really wrong.

In case of a total disaster where we haven't lost the data and lost the system, reinstalling the full system fully might take up to 2 days.
In the worse case, we could also lose the latest backups, e.g. if the backups I have now are somewhat corrupted, if the hosting company has some failures on their own backups etc...
The risk is low, but it can always happen...

It would be great if you could assess the consequences for your company if you lost the latest system data, and think of the following questions:
* how much data are you prepared to lose in the worst case
* how long you can stay without a running system

I can then review and see if I should verify the quality of the backup setup more often.

Cheers,

Jerome


To summarize, plan for the worst, hope for the best ! And be pragmatic.

No comments:

Post a Comment