Most companies prioritise the need for restoring IT in the event of a system breakdown. What they do not focus on, however, is what processes are in place to ensure the business can continue when automated or digital processes fail and specifically the role that employees have to play as they are ultimately the custodians of the processes that drive operations. It’s easy to, for example, provide a company with 10 seats to go and restore their IT systems and get them up and running again, but how do you accommodate a company with 100 employees that have just lost their premises in a disaster? This poses a different challenge and there are not many companies, providing disaster recovery in South Africa, that have the luxury of having that amount of space available waiting just to be occupied when there is a need for disaster recovery.
Although most companies are going the route of digitisation, manual processes still have a fundamental role to play. Take an airline, for example. If their electronic system for checking passengers onto the plane goes down, they have to have a manual back office process in place to perform this function. They cannot just ground the aircraft until the electronic system is restored. And herein lies the challenge: not many companies have these contingencies in place and they are putting themselves, their businesses and most important, their customers at risk.
While many organisations have these failover processes in place, they either do not test them regularly enough or their testing practices are inadequate. Many organisations have testing in place, but they perform a paper-based test. They see that there’s a manual process in place, the configuration is there and that it is documented, but that is where it ends. There is no actual testing from end-to-end by recovering on a piece of hardware and making sure it works, that the network is connected and that users can actually sign in and check the data. People tend to do disaster recovery tests to satisfy their auditors rather than making sure the business can continue to run in the event of a disaster.
There are a number of challenges in adopting an adequate disaster recovery strategy. The biggest challenge is the cost. You know you have to have it, but also that you might never need it. The second challenge is distance. What distance is the correct distance for you to have a disaster recovery site, particularly when you take incidents that could affect a broader geographical area into account? Here connectivity also comes into play, because the further away your disaster recovery is from your main site, the more expensive network constituencies become.
Possibly one of the biggest risks companies face is that, while they have disaster recovery processes in place, they tend to set it up on equipment that has become redundant or obsolete. In these cases companies have had to upgrade their equipment, so they use the new technology for their production line and then run their disaster recovery on the old machines. The challenge with this is that when they do need to do a recovery, they find that it’s not compatible or supported anymore, which means they are not capable of recovering core systems in reasonable timeframes.
DR often does not get the attention it deserves because it is an expenditure that is not really productive. That is why there is a trend to outsource their disaster recovery to a third party, where there is an agreement that they have to have the necessary equipment in place to ensure they can run your disaster recovery effectively and efficiently.
Companies that are either re-looking their disaster recovery strategy or implementing it for the first time, need to ensure that they understand which of their applications are the most critical as a first step. Some applications don’t need disaster recovery contingency and you can run your business without them. Interestingly though, between 5 and 7 years ago mail wasn’t deemed a high priority application. Today, that is deemed the first thing companies want to have recovered, because it has become mission critical to the running of their businesses.
Times have certainly changed
Companies must also understand the technology that is involved. You can’t just move a workload from a Unix platform to a Microsoft platform. You must ensure that the work breakdown structures and standard operating procedures and processes are documented, tested and updated at least twice a year. It’s easy to just write a process and file it away in a cupboard and do nothing further with it. It needs to be tested vigorously and on a regular basis. It’s not just about testing it, it’s about change management and fixing problem as and when you are presented with them.
Often change management is the biggest problem in disasters. A disaster happens because something changed and a change request didn’t notify the disaster recovery process of this change. If your disaster recovery manual is not up to date, it could significantly increase the amount of time spent to fix the problem.