Building Your Plan B - How to Build a Plan for When Your Systems Implode

For organisations that rely on accurate real-time data, having a disaster recovery plan is essential. That doesn’t need debating – the impact from a systems outage on your customers, staff and revenue is too great to ignore. Stakeholders expect you to recover your systems fast but they also expect pro-active disaster mitigation initiatives to fit within a budget. This leaves IT managers with the challenge of cost-effectively protecting systems from a wide range of risks, while being resourced for the best possible recovery outcome.

You can fall into the trap of protecting yourself from the last disaster you had. Once an outage has been experienced, it’s clear where the risks are and what measures are needed to prevent it in the future. Unfortunately, the next disaster may be completely different. We work with many organisations where timely access to data is critical, so we see a wide range of risks and recovery strategies. Here’s what we recommend you consider in your business continuity and disaster recovery plans.

Know your why

If you doubled your IT budget and had a second copy of everything (software, data and hardware), you’d have an amazing recovery time. That’s usually not feasible, so you need to pick and choose what recovery resources are appropriate for your business. You make those choices by coming back to your why – why do you need a fast recovery time? What’s the financial impact on your business if an IT system is down? And how long can your business survive without access to that data?

Though it’s inconvenient, a museum may handle an email outage of a few days. A financial institution however, is losing money every minute if customers can’t transact or having its reputation damaged if its Prime Brokerage operation can’t process trades. This influences your Recovery Time Objective (RTO) – what’s the required timeframe for restoring access to the IT system.

Next, it’s important to confirm how much of the data you can safely lose, if any. Ideally that would be zero, but it’s not cost effective to back up a museum’s emails every minute and we could get away with losing a few if they couldn’t be restored. Try telling a bank that they’ve lost the last hour of trading data or there are no records of any customer withdrawals in the last 30 minutes. This is your Recovery Point Objective – what is the minimum age that the data in your system must be for your business to operate? If the RPO is near zero, then this means you cannot operate your business with data that is not right up to date to the point of failure.

These two factors will drive the decisions you make regarding your business continuity plan and your disaster recovery plan. Your RTO and RPO can differ. In the case of the Australian Tax Office, an outage to their online services continued for days, but the data was recovered back to the time of the initial failure. They will have had a near-zero RPO.

Know your risks

Even companies with large IT budgets and resilient systems can still experience outages. The Australian Stock Exchange was recently hit with an unprecedented hardware failure, which was a malfunction the hardware vendor had never seen before.

Does the age of your server hardware mean it’s more likely to fail or it will be harder to source compatible replacement parts?
Are you reliant on a single fibre optic Internet connection at one server location?

These would both extend the time it takes to restore usual business operations – no good if you have a low RTO.

Do you know if your external facing systems are hardened against the latest threats?
Do staff know not to click on random email links and do you have security systems in place to protect you when they do? What’s the process if someone confesses they’ve handed over their account details to a phishing scam?

These vulnerabilities may impact the amount of data you will be able to recover, which should be judged against your RPO.

Spend the time documenting your risks and any control measures that you can put in place now, to reduce the likelihood of them occurring or the impact when they do.

Know your plan (and work it)

Your business continuity plan and disaster recovery plan don’t just kick in when a system goes down. They consist of pro-active steps and reactive responses and they need testing and updating regularly. At a minimum, they should include:

24×7 application monitoring – Not just system uptime. Are you monitoring database environments for changes and application performance and health?
Documented failover and restore processes – Ensure no critical steps are missed, especially during those sleep-deprived middle of the night outages.
Regular multi-level restore testing – From restoring an entire server to bringing back a single database instance without impacting other live data.
Communications plan – Document who needs to be notified and when, through the outage, including escalation processes.
Change control – Add your disaster recovery plan to your regular change control process. Assess approved changes to confirm if they impact your disaster recovery plan or not.
BCP and DR role playing – Six monthly or annual disaster scenarios capture the impact of changes in people and your environment. “Kill off” experienced staff to see how the rest of the team fares with only the documentation to follow.
Consider the Cloud – Investigate database backup to the Cloud as another recovery option. With SQL Server, this can be combined with on-premises Availability Groups.

Conclusion

There’s a lot to consider to ensure that your systems are available and functioning well, before you’ve even addressed recovery from a total failure. Knowing your risks, implementing your control measures and constantly monitoring your environment are all key components. Unfortunately when IT staff are distracted with daily issues, the pro-active controls can slip. Consider third party options and services like WARDY IT Solutions’ Virtual DBA service, to provide extra coverage, support and experience to your existing IT team. For MediaRadar, it meant a 72 hour recovery time was reduced to 1.5 hours.

The success of your recovery will be greatly impacted by the planning you do now.