As storm Ophelia batters Ireland we took the decision to test our disaster plans in case of total country to country communication infrastructure failure.
This is a semi technical article explaining what we set out to achieve, how we carried out the test, and our findings.
The Test Plan
To simulate the loss of Ireland from our cluster facilities, and to monitor our clustered applications like Case Centre Pro and Spam Safe Mail.
Implementing the disaster plan
Yesterday, we took node 1 in our cluster out of action by altering a setting in the program to check Ireland is visible and available. By setting the “Node Active” field to zero, no checks are made. When the cluster expires the old check values, the other nodes in the cluster thinks that node 1 (in Dublin City West) has failed, and they then vote to update DNS based on this failure, effectively re-routing client requests to Manchester1, Manchester2 and Shropshire. This happens all within about a minute, so all modern browsers would simply use another node whilst this takes place with the user seeing a “sluggish” delay. Once DNS has changed, usually within another 3 to 5 minutes, then the users see no further delay.
Cluster Application Monitoring
The three remaining nodes have been maintaining uninterrupted service no problem, no delays. A simulated loss of a node with no loss of service. Clients have not noticed.
Reversion
So, this morning, we wake up running on only three of our nodes. Had a leisurely breakfast, and ready to revert back to the full compliment of nodes in the cluster.
This has simply been a matter if changing that zero, described above, back to a one again. The checks will start being made. The nodes in the cluster will then vote that node back in, and the DNS automatically gets updated.
Post Plan Findings
What is evident is that we have the infrastructure and resourcing capability to ensure that our applications are still available even when there is a natural disaster like a storm or hurricane.